[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784719#comment-15784719 ] Bibin A Chundatt commented on YARN-6031: [~templedf] {quote} so that when using -remove-application-from-state-store you know what you're purging. {quote} Another concern about removal of application from store is already running AM will be in dormant state. But if we allow application to recover and make sure the application is killed from scheduler then application will be killed with a reason. Currently in MR side if AM is running and request for non available label MR AM kills application the same could be implemented in all application. > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5709) Cleanup leader election configs and pluggability
[ https://issues.apache.org/jira/browse/YARN-5709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784442#comment-15784442 ] Junping Du commented on YARN-5709: -- Interesting...Why these javadoc warnings only against jdk v1.8? > Cleanup leader election configs and pluggability > > > Key: YARN-5709 > URL: https://issues.apache.org/jira/browse/YARN-5709 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > Fix For: 2.9.0, 3.0.0-alpha2 > > Attachments: yarn-5709-branch-2.8.01.patch, > yarn-5709-branch-2.8.02.patch, yarn-5709-branch-2.8.03.patch, > yarn-5709-branch-2.8.patch, yarn-5709-wip.2.patch, yarn-5709.1.patch, > yarn-5709.2.patch, yarn-5709.3.patch, yarn-5709.4.patch > > > While reviewing YARN-5677 and YARN-5694, I noticed we could make the > curator-based election code cleaner. It is nicer to get this fixed in 2.8 > before we ship it, but this can be done at a later time as well. > # By EmbeddedElector, we meant it was running as part of the RM daemon. Since > the Curator-based elector is also running embedded, I feel the code should be > checking for {{!curatorBased}} instead of {{isEmbeddedElector}} > # {{LeaderElectorService}} should probably be named > {{CuratorBasedEmbeddedElectorService}} or some such. > # The code that initializes the elector should be at the same place > irrespective of whether it is curator-based or not. > # We seem to be caching the CuratorFramework instance in RM. It makes more > sense for it to be in RMContext. If others are okay with it, we might even be > better of having {{RMContext#getCurator()}} method to lazily create the > curator framework and then cache it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java
[ https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784417#comment-15784417 ] zhangshilong commented on YARN-4090: would you please tell me yarn version you used? In trunk: FairScheduler.getQueueUserAclInfo will not lock the FSQueue object. FSQueue object will be locked only when decResourceUsage or incrResourceUsage. FairScheduler: {code:java} @Override public List getQueueUserAclInfo() { UserGroupInformation user; try { user = UserGroupInformation.getCurrentUser(); } catch (IOException ioe) { return new ArrayList(); } return queueMgr.getRootQueue().getQueueUserAclInfo(user); } {code} FSParentQueue.java {code:java} @Override public List getQueueUserAclInfo(UserGroupInformation user) { List userAcls = new ArrayList<>(); // Add queue acls userAcls.add(getUserAclInfo(user)); // Add children queue acls readLock.lock(); try { for (FSQueue child : childQueues) { userAcls.addAll(child.getQueueUserAclInfo(user)); } } finally { readLock.unlock(); } return userAcls; } {code} > Make Collections.sort() more efficient in FSParentQueue.java > > > Key: YARN-4090 > URL: https://issues.apache.org/jira/browse/YARN-4090 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Reporter: Xianyin Xin >Assignee: Xianyin Xin > Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, > YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, sampling1.jpg, > sampling2.jpg > > > Collections.sort() consumes too much time in a scheduling round. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4090) Make Collections.sort() more efficient in FSParentQueue.java
[ https://issues.apache.org/jira/browse/YARN-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784392#comment-15784392 ] zhangshilong commented on YARN-4090: [~xinxianyin] [~yufeigu] This optimization works in our environment very well, I hope to continue this issue. > Make Collections.sort() more efficient in FSParentQueue.java > > > Key: YARN-4090 > URL: https://issues.apache.org/jira/browse/YARN-4090 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Reporter: Xianyin Xin >Assignee: Xianyin Xin > Attachments: YARN-4090-TestResult.pdf, YARN-4090-preview.patch, > YARN-4090.001.patch, YARN-4090.002.patch, YARN-4090.003.patch, sampling1.jpg, > sampling2.jpg > > > Collections.sort() consumes too much time in a scheduling round. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved
[ https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784388#comment-15784388 ] Tao Yang commented on YARN-6029: Thanks [~wangda]. Updated priority to Critical and Attached new patch for review. This patch needs add indentation in synchronized block. Diff code without changing space like this: {code} @Override - public synchronized CSAssignment assignContainers(Resource clusterResource, + public CSAssignment assignContainers(Resource clusterResource, FiCaSchedulerNode node, ResourceLimits currentResourceLimits, SchedulingMode schedulingMode) { +synchronized (this) { updateCurrentResourceLimits(currentResourceLimits, clusterResource); if (LOG.isDebugEnabled()) { @@ -906,6 +907,7 @@ public synchronized CSAssignment assignContainers(Resource clusterResource, } setPreemptionAllowed(currentResourceLimits, node.getPartition()); +} // Check for reserved resources RMContainer reservedContainer = node.getReservedContainer(); @@ -923,6 +925,7 @@ public synchronized CSAssignment assignContainers(Resource clusterResource, } } +synchronized (this) { // if our queue cannot access this node, just return if (schedulingMode == SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY && !accessibleToPartition(node.getPartition())) { @@ -1019,6 +1022,7 @@ public synchronized CSAssignment assignContainers(Resource clusterResource, return CSAssignment.NULL_ASSIGNMENT; } + } {code} > CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by > Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to > release a reserved container > -- > > Key: YARN-6029 > URL: https://issues.apache.org/jira/browse/YARN-6029 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.8.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Attachments: YARN-6029.001.patch, YARN-6029.002.patch, deadlock.jstack > > > When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls > YarnClient#getQueueAclsInfo) just at the moment that > LeafQueue#assignContainers is called and before notifying parent queue to > release resource (should release a reserved container), then ResourceManager > can deadlock. I found this problem on our testing environment for hadoop2.8. > Reproduce the deadlock in chronological order > * 1. Thread A (ResourceManager Event Processor) calls synchronized > LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a) > * 2. Thread B (IPC Server handler) calls synchronized > ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue > root), iterates over children queue acls and is blocked when calling > synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of > queue root.a is hold by Thread A) > * 3. Thread A wants to inform the parent queue that a container is being > completed and is blocked when invoking synchronized > ParentQueue#internalReleaseResource method (the ParentQueue instance lock of > queue root is hold by Thread B) > I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be > removed to solve this problem, since this method appears to not affect fields > of LeafQueue instance. > Attach patch with UT for review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved co
[ https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6029: --- Attachment: YARN-6029.002.patch > CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by > Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to > release a reserved container > -- > > Key: YARN-6029 > URL: https://issues.apache.org/jira/browse/YARN-6029 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.8.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Attachments: YARN-6029.001.patch, YARN-6029.002.patch, deadlock.jstack > > > When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls > YarnClient#getQueueAclsInfo) just at the moment that > LeafQueue#assignContainers is called and before notifying parent queue to > release resource (should release a reserved container), then ResourceManager > can deadlock. I found this problem on our testing environment for hadoop2.8. > Reproduce the deadlock in chronological order > * 1. Thread A (ResourceManager Event Processor) calls synchronized > LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a) > * 2. Thread B (IPC Server handler) calls synchronized > ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue > root), iterates over children queue acls and is blocked when calling > synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of > queue root.a is hold by Thread A) > * 3. Thread A wants to inform the parent queue that a container is being > completed and is blocked when invoking synchronized > ParentQueue#internalReleaseResource method (the ParentQueue instance lock of > queue root is hold by Thread B) > I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be > removed to solve this problem, since this method appears to not affect fields > of LeafQueue instance. > Attach patch with UT for review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784349#comment-15784349 ] Ying Zhang edited comment on YARN-6031 at 12/29/16 3:03 AM: {quote} Do you think we can make the log message a bit more explicit, i.e. say that the failure was because node labels have been disabled and point out the property that the admin should use to disable/enable node labels? {quote} Hi [~templedf], the following error message will be printed in RM log: {noformat} 2016-12-28 01:00:22,694 WARN resourcemanager.RMAppManager (RMAppManager.java:validateAndCreateResourceRequest(400)) - RM app submission failed in validating AM resource request for application application_xx org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid resource request, node label not enabled but request contains label expression at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:396) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:341) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:321) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:439) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) ... ... 2016-12-28 01:00:22,694 ERROR resourcemanager.RMAppManager (RMAppManager.java:recover(455)) - Failed to recover application application_xx {noformat} The first error message is printed by the check which we fail at the first place, the second error message is printed by the code in the patch. I'm thinking this would be enough hint for the root cause. was (Author: ying zhang): {quote} Do you think we can make the log message a bit more explicit, i.e. say that the failure was because node labels have been disabled and point out the property that the admin should use to disable/enable node labels? {quote} Hi [~templedf], the following error message will be printed in RM log: {noformat} 2016-12-28 01:00:22,694 WARN resourcemanager.RMAppManager (RMAppManager.java:validateAndCreateResourceRequest(400)) - RM app submission failed in validating AM resource request for application application_1482915192452_0001 org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid resource request, node label not enabled but request contains label expression at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:396) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:341) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:321) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:439) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) ... ... 2016-12-28 01:00:22,694 ERROR resourcemanager.RMAppManager (RMAppManager.java:recover(455)) - Failed to recover application application_1482915192452_0001 {noformat} > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not
[jira] [Updated] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved co
[ https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6029: --- Priority: Critical (was: Blocker) > CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by > Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to > release a reserved container > -- > > Key: YARN-6029 > URL: https://issues.apache.org/jira/browse/YARN-6029 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.8.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Attachments: YARN-6029.001.patch, deadlock.jstack > > > When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls > YarnClient#getQueueAclsInfo) just at the moment that > LeafQueue#assignContainers is called and before notifying parent queue to > release resource (should release a reserved container), then ResourceManager > can deadlock. I found this problem on our testing environment for hadoop2.8. > Reproduce the deadlock in chronological order > * 1. Thread A (ResourceManager Event Processor) calls synchronized > LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a) > * 2. Thread B (IPC Server handler) calls synchronized > ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue > root), iterates over children queue acls and is blocked when calling > synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of > queue root.a is hold by Thread A) > * 3. Thread A wants to inform the parent queue that a container is being > completed and is blocked when invoking synchronized > ParentQueue#internalReleaseResource method (the ParentQueue instance lock of > queue root is hold by Thread B) > I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be > removed to solve this problem, since this method appears to not affect fields > of LeafQueue instance. > Attach patch with UT for review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784349#comment-15784349 ] Ying Zhang edited comment on YARN-6031 at 12/29/16 3:00 AM: {quote} Do you think we can make the log message a bit more explicit, i.e. say that the failure was because node labels have been disabled and point out the property that the admin should use to disable/enable node labels? {quote} Hi [~templedf], the following error message will be printed in RM log: {noformat} 2016-12-28 01:00:22,694 WARN resourcemanager.RMAppManager (RMAppManager.java:validateAndCreateResourceRequest(400)) - RM app submission failed in validating AM resource request for application application_1482915192452_0001 org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid resource request, node label not enabled but request contains label expression at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:396) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:341) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:321) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:439) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) ... ... 2016-12-28 01:00:22,694 ERROR resourcemanager.RMAppManager (RMAppManager.java:recover(455)) - Failed to recover application application_1482915192452_0001 {noformat} was (Author: ying zhang): {quote} Do you think we can make the log message a bit more explicit, i.e. say that the failure was because node labels have been disabled and point out the property that the admin should use to disable/enable node labels? {quote} Hi [~templedf], the following error message will be printed in RM log: 2016-12-28 01:00:22,694 WARN resourcemanager.RMAppManager (RMAppManager.java:validateAndCreateResourceRequest(400)) - RM app submission failed in validating AM resource request for application application_1482915192452_0001 org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid resource request, node label not enabled but request contains label expression at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:396) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:341) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:321) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:439) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) ... ... 2016-12-28 01:00:22,694 ERROR resourcemanager.RMAppManager (RMAppManager.java:recover(455)) - Failed to recover application application_1482915192452_0001 > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at
[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784349#comment-15784349 ] Ying Zhang commented on YARN-6031: -- {quote} Do you think we can make the log message a bit more explicit, i.e. say that the failure was because node labels have been disabled and point out the property that the admin should use to disable/enable node labels? {quote} Hi [~templedf], the following error message will be printed in RM log: 2016-12-28 01:00:22,694 WARN resourcemanager.RMAppManager (RMAppManager.java:validateAndCreateResourceRequest(400)) - RM app submission failed in validating AM resource request for application application_1482915192452_0001 org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid resource request, node label not enabled but request contains label expression at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:396) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:341) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:321) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:439) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) ... ... 2016-12-28 01:00:22,694 ERROR resourcemanager.RMAppManager (RMAppManager.java:recover(455)) - Failed to recover application application_1482915192452_0001 > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784332#comment-15784332 ] Ying Zhang commented on YARN-6031: -- So what's the next move? I'm a little confused. Are we going to address this issue by the general approach proposed by [~templedf] and [~sunilg]? > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784327#comment-15784327 ] Ying Zhang commented on YARN-6031: -- {quote} We could ignore/reset labels to default in resourcerequest when nodelabels are disabled. {quote} Agree with Daniel, reset might not be a good idea. Admin should be aware of the failure and take proper action. {quote} IIUC ignore validation on recovery also should work. {quote} I was thinking the same at the first place (see my comment at YARN-4465). Then I agree what [~sunilg] said for the same reason as above. We should not hide the failure/wrong configuration. {quote} IMHO should be acceptable since any application submitted with labels when feature is disabled gets rejected. {quote} Yes, you're right. I agree with the current way, just want to clarify so that everyone is on the same page:-) > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5685) Non-embedded HA failover is broken
[ https://issues.apache.org/jira/browse/YARN-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784191#comment-15784191 ] Karthik Kambatla commented on YARN-5685: On YARN-5709, Jian made the point that it is unlikely we will add a non-embedded leader elector, and we could always add a config then. I am fine with removing the config for embedded. > Non-embedded HA failover is broken > -- > > Key: YARN-5685 > URL: https://issues.apache.org/jira/browse/YARN-5685 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.0, 3.0.0-alpha1 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Critical > Labels: oct16-hard > Attachments: YARN-5685.001.patch, YARN-5685.002.patch > > > If HA is enabled with automatic failover enabled and embedded failover > disabled, all RMs all come up in standby state. To make one of them active, > the {{--forcemanual}} flag must be used when manually triggering the state > change. Should the active go down, the standby will not become active and > must be manually transitioned with the {{--forcemanual}} flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5709) Cleanup leader election configs and pluggability
[ https://issues.apache.org/jira/browse/YARN-5709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784177#comment-15784177 ] Hadoop QA commented on YARN-5709: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 27s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 28s{color} | {color:green} branch-2.8 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 19s{color} | {color:green} branch-2.8 passed with JDK v1.8.0_111 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 29s{color} | {color:green} branch-2.8 passed with JDK v1.7.0_121 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} branch-2.8 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 8s{color} | {color:green} branch-2.8 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 29s{color} | {color:green} branch-2.8 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 32s{color} | {color:green} branch-2.8 passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 26s{color} | {color:red} hadoop-yarn-server-resourcemanager in branch-2.8 failed with JDK v1.8.0_111. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 48s{color} | {color:green} branch-2.8 passed with JDK v1.7.0_121 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 8s{color} | {color:green} the patch passed with JDK v1.8.0_111 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 35s{color} | {color:green} the patch passed with JDK v1.7.0_121 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 35s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 41s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 4 new + 321 unchanged - 9 fixed = 325 total (was 330) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 1s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 22s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_111. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s{color} | {color:green} the patch passed with JDK v1.7.0_121 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 31s{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_121. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 25s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_121. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}190m 8s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_111 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | |
[jira] [Updated] (YARN-5830) Avoid preempting AM containers
[ https://issues.apache.org/jira/browse/YARN-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yufei Gu updated YARN-5830: --- Attachment: YARN-5830.002.patch [~kasha], uploaded the new patch for all your comments. YARN-6038 is created and updated the TODO in this patch. > Avoid preempting AM containers > -- > > Key: YARN-5830 > URL: https://issues.apache.org/jira/browse/YARN-5830 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Karthik Kambatla >Assignee: Yufei Gu > Attachments: YARN-5830.001.patch, YARN-5830.002.patch > > > While considering containers for preemption, avoid AM containers unless > absolutely necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved
[ https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784103#comment-15784103 ] Wangda Tan commented on YARN-6029: -- And in addition, I suggest to downgrade severity to critical to unblock 2.8, since this only happens rarely. > CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by > Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to > release a reserved container > -- > > Key: YARN-6029 > URL: https://issues.apache.org/jira/browse/YARN-6029 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.8.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Blocker > Attachments: YARN-6029.001.patch, deadlock.jstack > > > When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls > YarnClient#getQueueAclsInfo) just at the moment that > LeafQueue#assignContainers is called and before notifying parent queue to > release resource (should release a reserved container), then ResourceManager > can deadlock. I found this problem on our testing environment for hadoop2.8. > Reproduce the deadlock in chronological order > * 1. Thread A (ResourceManager Event Processor) calls synchronized > LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a) > * 2. Thread B (IPC Server handler) calls synchronized > ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue > root), iterates over children queue acls and is blocked when calling > synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of > queue root.a is hold by Thread A) > * 3. Thread A wants to inform the parent queue that a container is being > completed and is blocked when invoking synchronized > ParentQueue#internalReleaseResource method (the ParentQueue instance lock of > queue root is hold by Thread B) > I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be > removed to solve this problem, since this method appears to not affect fields > of LeafQueue instance. > Attach patch with UT for review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved
[ https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784099#comment-15784099 ] Wangda Tan commented on YARN-6029: -- bq. I'm not clear about this. Is it worth to ensure consistency of acls through reducing the efficiency of scheduler? It gonna be inefficient, previously getQueueInfo hold scheduler lock and that causes problems. bq. We also noticed that it doesn't hold the lock of LeafQueue instance when updating acls (CapacityScheduler#setQueueAcls) so that current logic doesn't guarantee the consistency of acls. Yeah you're correct... I think we could directly get queue ACL info from CS by invoking authorizer#checkPermissions, and we can have a separate lock to protect permission get/set. cc: [~jianhe] But this is should be a separated patch, since we need to fix getQueueInfo as well. I think we can go ahead to fix locks inside LQ#assignContainers, thoughts? > CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by > Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to > release a reserved container > -- > > Key: YARN-6029 > URL: https://issues.apache.org/jira/browse/YARN-6029 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.8.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Blocker > Attachments: YARN-6029.001.patch, deadlock.jstack > > > When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls > YarnClient#getQueueAclsInfo) just at the moment that > LeafQueue#assignContainers is called and before notifying parent queue to > release resource (should release a reserved container), then ResourceManager > can deadlock. I found this problem on our testing environment for hadoop2.8. > Reproduce the deadlock in chronological order > * 1. Thread A (ResourceManager Event Processor) calls synchronized > LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a) > * 2. Thread B (IPC Server handler) calls synchronized > ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue > root), iterates over children queue acls and is blocked when calling > synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of > queue root.a is hold by Thread A) > * 3. Thread A wants to inform the parent queue that a container is being > completed and is blocked when invoking synchronized > ParentQueue#internalReleaseResource method (the ParentQueue instance lock of > queue root is hold by Thread B) > I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be > removed to solve this problem, since this method appears to not affect fields > of LeafQueue instance. > Attach patch with UT for review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5685) Non-embedded HA failover is broken
[ https://issues.apache.org/jira/browse/YARN-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784070#comment-15784070 ] Daniel Templeton commented on YARN-5685: I would look at it the other way around. The default assumption should be embedded leader election. If an out-of-process election is added, it will be the new and different thing, and it should then have a config param that enables it. Today with no alternative, having a config param for embedded makes no sense. The two options are: cluster works and cluster broken. If you're really convinced that leadership election is something that will change again soon, then I'd still deprecate the embedded option and add a new option that selects the election type: "embedded" or "external", with the "external" value only added once that's actually a thing. Heck, throw in "manual" as a valid value, and we can also deprecate the config param to enable automatic failover. > Non-embedded HA failover is broken > -- > > Key: YARN-5685 > URL: https://issues.apache.org/jira/browse/YARN-5685 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.0, 3.0.0-alpha1 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Critical > Labels: oct16-hard > Attachments: YARN-5685.001.patch, YARN-5685.002.patch > > > If HA is enabled with automatic failover enabled and embedded failover > disabled, all RMs all come up in standby state. To make one of them active, > the {{--forcemanual}} flag must be used when manually triggering the state > change. Should the active go down, the standby will not become active and > must be manually transitioned with the {{--forcemanual}} flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-5830) Avoid preempting AM containers
[ https://issues.apache.org/jira/browse/YARN-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784024#comment-15784024 ] Yufei Gu edited comment on YARN-5830 at 12/28/16 11:58 PM: --- [~kasha], thanks for the review. The high-level approach: 1. In the first node, sort the containersToCheck list by putting all non-AM containers first. 2. We check containers one by one. If we are lucky to found a solution without any AM container, just return the container list. But if we are not so lucky and found a solution with AM container, we record this solution in {{potentialContainers}} and {{break}} here to move to next node because we may find an solution without AM container in next node. 3. Repeat step 1 and 2 for each node, the only difference is if we've already got a solution with AM container, we can {{break}} the loop once we meet first AM container. 4. Return the {{potentialContainers}} if we cannot find any solution without AM containers. This approach, I was trying to avoid sort AM and non-AM containers across nodes since it may be super inefficient if # of node are huge. One of my concern is that we should check many nodes if there are many nodes with AM containers, which may be not efficient. We can put a threshold to avoid it. was (Author: yufeigu): [~kasha], thanks for the review. The high-level approach: 1. In first nodes, sort the containersToCheck list by putting all non-AM containers first. 2. We check containers one by one. If we are lucky to found a solution without any AM container, just return the container list. But if we are not so lucky and found a solution with AM container, we record this solution in {{potentialContainers}} and {{break}} here to move to next node because we may find an solution without AM container in next node. 3. Repeat step 1 and 2 for each node, the only difference is if we've already got a solution with AM container, we can {{break}} the loop once we meet first AM container. 4. Return the {{potentialContainers}} if we cannot find any solution without AM containers. One of my concern is that we should check many nodes if there are many nodes with AM containers, which may be not efficient. We can put a threshold to avoid it. > Avoid preempting AM containers > -- > > Key: YARN-5830 > URL: https://issues.apache.org/jira/browse/YARN-5830 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Karthik Kambatla >Assignee: Yufei Gu > Attachments: YARN-5830.001.patch > > > While considering containers for preemption, avoid AM containers unless > absolutely necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5556) Support for deleting queues without requiring a RM restart
[ https://issues.apache.org/jira/browse/YARN-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784035#comment-15784035 ] Naganarasimha Garla commented on YARN-5556: --- Thanks Xuan, will start working on it and update the patch at the earliest... > Support for deleting queues without requiring a RM restart > -- > > Key: YARN-5556 > URL: https://issues.apache.org/jira/browse/YARN-5556 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Xuan Gong >Assignee: Naganarasimha G R > Attachments: YARN-5556.v1.001.patch, YARN-5556.v1.002.patch, > YARN-5556.v1.003.patch, YARN-5556.v1.004.patch > > > Today, we could add or modify queues without restarting the RM, via a CS > refresh. But for deleting queue, we have to restart the ResourceManager. We > could support for deleting queues without requiring a RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-5830) Avoid preempting AM containers
[ https://issues.apache.org/jira/browse/YARN-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784024#comment-15784024 ] Yufei Gu edited comment on YARN-5830 at 12/28/16 11:53 PM: --- [~kasha], thanks for the review. The high-level approach: 1. In first nodes, sort the containersToCheck list by putting all non-AM containers first. 2. We check containers one by one. If we are lucky to found a solution without any AM container, just return the container list. But if we are not so lucky and found a solution with AM container, we record this solution in {{potentialContainers}} and {{break}} here to move to next node because we may find an solution without AM container in next node. 3. Repeat step 1 and 2 for each node, the only difference is if we've already got a solution with AM container, we can {{break}} the loop once we meet first AM container. 4. Return the {{potentialContainers}} if we cannot find any solution without AM containers. One of my concern is that we should check many nodes if there are many nodes with AM containers, which may be not efficient. We can put a threshold to avoid it. was (Author: yufeigu): [~kasha], thanks for the review. The high-level approach: 1. In first nodes, sort the containersToCheck list by putting all non-AM containers first. 2. We check containers one by one. If we are lucky to found a solution without any AM container, just return the container list. But not lucky and we find a solution with AM container, we record this solution in {{potentialContainers}} and {{break}} here to move to next node because we may find an solution without AM container in next node. 3. Repeat step 1 and 2 for each node, the only difference is if we've already got a solution with AM container, we can {{break}} the loop once we meet first AM container. 4. Return the {{potentialContainers}} if we cannot find any solution without AM containers. > Avoid preempting AM containers > -- > > Key: YARN-5830 > URL: https://issues.apache.org/jira/browse/YARN-5830 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Karthik Kambatla >Assignee: Yufei Gu > Attachments: YARN-5830.001.patch > > > While considering containers for preemption, avoid AM containers unless > absolutely necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-5709) Cleanup leader election configs and pluggability
[ https://issues.apache.org/jira/browse/YARN-5709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Templeton updated YARN-5709: --- Comment: was deleted (was: Also, looks to me like this is committed to branch-2, but not trunk.) > Cleanup leader election configs and pluggability > > > Key: YARN-5709 > URL: https://issues.apache.org/jira/browse/YARN-5709 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > Fix For: 2.9.0, 3.0.0-alpha2 > > Attachments: yarn-5709-branch-2.8.01.patch, > yarn-5709-branch-2.8.02.patch, yarn-5709-branch-2.8.03.patch, > yarn-5709-branch-2.8.patch, yarn-5709-wip.2.patch, yarn-5709.1.patch, > yarn-5709.2.patch, yarn-5709.3.patch, yarn-5709.4.patch > > > While reviewing YARN-5677 and YARN-5694, I noticed we could make the > curator-based election code cleaner. It is nicer to get this fixed in 2.8 > before we ship it, but this can be done at a later time as well. > # By EmbeddedElector, we meant it was running as part of the RM daemon. Since > the Curator-based elector is also running embedded, I feel the code should be > checking for {{!curatorBased}} instead of {{isEmbeddedElector}} > # {{LeaderElectorService}} should probably be named > {{CuratorBasedEmbeddedElectorService}} or some such. > # The code that initializes the elector should be at the same place > irrespective of whether it is curator-based or not. > # We seem to be caching the CuratorFramework instance in RM. It makes more > sense for it to be in RMContext. If others are okay with it, we might even be > better of having {{RMContext#getCurator()}} method to lazily create the > curator framework and then cache it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5830) Avoid preempting AM containers
[ https://issues.apache.org/jira/browse/YARN-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784024#comment-15784024 ] Yufei Gu commented on YARN-5830: [~kasha], thanks for the review. The high-level approach: 1. In first nodes, sort the containersToCheck list by putting all non-AM containers first. 2. We check containers one by one. If we are lucky to found a solution without any AM container, just return the container list. But not lucky and we find a solution with AM container, we record this solution in {{potentialContainers}} and {{break}} here to move to next node because we may find an solution without AM container in next node. 3. Repeat step 1 and 2 for each node, the only difference is if we've already got a solution with AM container, we can {{break}} the loop once we meet first AM container. 4. Return the {{potentialContainers}} if we cannot find any solution without AM containers. > Avoid preempting AM containers > -- > > Key: YARN-5830 > URL: https://issues.apache.org/jira/browse/YARN-5830 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Karthik Kambatla >Assignee: Yufei Gu > Attachments: YARN-5830.001.patch > > > While considering containers for preemption, avoid AM containers unless > absolutely necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5556) Support for deleting queues without requiring a RM restart
[ https://issues.apache.org/jira/browse/YARN-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784018#comment-15784018 ] Xuan Gong commented on YARN-5556: - [~Naganarasimha] Given all the dependent patches have been committed, could you rebase the patch, please ? > Support for deleting queues without requiring a RM restart > -- > > Key: YARN-5556 > URL: https://issues.apache.org/jira/browse/YARN-5556 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Xuan Gong >Assignee: Naganarasimha G R > Attachments: YARN-5556.v1.001.patch, YARN-5556.v1.002.patch, > YARN-5556.v1.003.patch, YARN-5556.v1.004.patch > > > Today, we could add or modify queues without restarting the RM, via a CS > refresh. But for deleting queue, we have to restart the ResourceManager. We > could support for deleting queues without requiring a RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5755) Enhancements to STOP queue handling
[ https://issues.apache.org/jira/browse/YARN-5755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong resolved YARN-5755. - Resolution: Duplicate Fix Version/s: 3.0.0-alpha2 2.9.0 > Enhancements to STOP queue handling > --- > > Key: YARN-5755 > URL: https://issues.apache.org/jira/browse/YARN-5755 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xuan Gong >Assignee: Xuan Gong > Fix For: 2.9.0, 3.0.0-alpha2 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5755) Enhancements to STOP queue handling
[ https://issues.apache.org/jira/browse/YARN-5755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784014#comment-15784014 ] Xuan Gong commented on YARN-5755: - Close this as duplicate. The issue has already been handled in YARN-5756 > Enhancements to STOP queue handling > --- > > Key: YARN-5755 > URL: https://issues.apache.org/jira/browse/YARN-5755 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xuan Gong >Assignee: Xuan Gong > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5987) NM configured command to collect heap dump of preempted container
[ https://issues.apache.org/jira/browse/YARN-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784003#comment-15784003 ] Daniel Templeton commented on YARN-5987: Thanks, [~miklos.szeg...@cloudera.com]. Is it possible to leverage any of the work on YARN-2261 for this JIRA? > NM configured command to collect heap dump of preempted container > - > > Key: YARN-5987 > URL: https://issues.apache.org/jira/browse/YARN-5987 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Miklos Szegedi >Assignee: Miklos Szegedi > Attachments: YARN-5987.000.patch, YARN-5987.001.patch > > > The node manager can kill a container, if it exceeds the assigned memory > limits. It would be nice to have a configuration entry to set up a command > that can collect additional debug information, if needed. The collected > information can be used for root cause analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4882) Change the log level to DEBUG for recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783996#comment-15783996 ] Hudson commented on YARN-4882: -- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #11051 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/11051/]) YARN-4882. Change the log level to DEBUG for recovering completed (rkanter: rev f216276d2164c6564632c571fd3adbb03bc8b3e4) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java > Change the log level to DEBUG for recovering completed applications > --- > > Key: YARN-4882 > URL: https://issues.apache.org/jira/browse/YARN-4882 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Rohith Sharma K S >Assignee: Daniel Templeton > Labels: oct16-easy > Fix For: 2.9.0, 3.0.0-alpha2 > > Attachments: YARN-4882.001.patch, YARN-4882.002.patch, > YARN-4882.003.patch, YARN-4882.004.patch, YARN-4882.005.patch > > > I think for recovering completed applications no need to log as INFO, rather > it can be made it as DEBUG. The problem seen from large cluster is if any > issue happens during RM start up and continuously switching , then RM logs > are filled with most with recovering applications only. > There are 6 lines are logged for 1 applications as I shown in below logs, > then consider RM default value for max-completed applications is 10K. So for > each switch 10K*6=60K lines will be added which is not useful I feel. > {noformat} > 2016-03-01 10:20:59,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Default priority > level is set to application:application_1456298208485_21507 > 2016-03-01 10:20:59,094 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering > app: application_1456298208485_21507 with 1 attempts and final state = > FINISHED > 2016-03-01 10:20:59,100 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Recovering attempt: appattempt_1456298208485_21507_01 with final state: > FINISHED > 2016-03-01 10:20:59,107 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1456298208485_21507_01 State change from NEW to FINISHED > 2016-03-01 10:20:59,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1456298208485_21507 State change from NEW to FINISHED > 2016-03-01 10:20:59,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=rohith > OPERATION=Application Finished - Succeeded TARGET=RMAppManager > RESULT=SUCCESS APPID=application_1456298208485_21507 > {noformat} > The main problem is missing important information's from the logs before RM > unstable. Even though log roll back is 50 or 100, in a short period all these > logs will be rolled out and all the logs contains only RM switching > information that too recovering applications!!. > I suggest at least completed applications recovery should be logged as DEBUG. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4882) Change the log level to DEBUG for recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783989#comment-15783989 ] Daniel Templeton commented on YARN-4882: Thanks, [~rkanter]! > Change the log level to DEBUG for recovering completed applications > --- > > Key: YARN-4882 > URL: https://issues.apache.org/jira/browse/YARN-4882 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Rohith Sharma K S >Assignee: Daniel Templeton > Labels: oct16-easy > Fix For: 2.9.0, 3.0.0-alpha2 > > Attachments: YARN-4882.001.patch, YARN-4882.002.patch, > YARN-4882.003.patch, YARN-4882.004.patch, YARN-4882.005.patch > > > I think for recovering completed applications no need to log as INFO, rather > it can be made it as DEBUG. The problem seen from large cluster is if any > issue happens during RM start up and continuously switching , then RM logs > are filled with most with recovering applications only. > There are 6 lines are logged for 1 applications as I shown in below logs, > then consider RM default value for max-completed applications is 10K. So for > each switch 10K*6=60K lines will be added which is not useful I feel. > {noformat} > 2016-03-01 10:20:59,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Default priority > level is set to application:application_1456298208485_21507 > 2016-03-01 10:20:59,094 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering > app: application_1456298208485_21507 with 1 attempts and final state = > FINISHED > 2016-03-01 10:20:59,100 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Recovering attempt: appattempt_1456298208485_21507_01 with final state: > FINISHED > 2016-03-01 10:20:59,107 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1456298208485_21507_01 State change from NEW to FINISHED > 2016-03-01 10:20:59,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1456298208485_21507 State change from NEW to FINISHED > 2016-03-01 10:20:59,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=rohith > OPERATION=Application Finished - Succeeded TARGET=RMAppManager > RESULT=SUCCESS APPID=application_1456298208485_21507 > {noformat} > The main problem is missing important information's from the logs before RM > unstable. Even though log roll back is 50 or 100, in a short period all these > logs will be rolled out and all the logs contains only RM switching > information that too recovering applications!!. > I suggest at least completed applications recovery should be logged as DEBUG. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5258) Document Use of Docker with LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783980#comment-15783980 ] Hadoop QA commented on YARN-5258: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 56s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 12m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 16m 31s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:a9ad5d6 | | JIRA Issue | YARN-5258 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12844992/YARN-5258.004.patch | | Optional Tests | asflicense mvnsite xml | | uname | Linux f800184c7201 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 9ca54f4 | | modules | C: hadoop-project hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site U: . | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/14487/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Document Use of Docker with LinuxContainerExecutor > -- > > Key: YARN-5258 > URL: https://issues.apache.org/jira/browse/YARN-5258 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Affects Versions: 2.8.0 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Critical > Labels: oct16-easy > Attachments: YARN-5258.001.patch, YARN-5258.002.patch, > YARN-5258.003.patch, YARN-5258.004.patch > > > There aren't currently any docs that explain how to configure Docker and all > of its various options aside from reading all of the JIRAs. We need to > document the configuration, use, and troubleshooting, along with helpful > examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4882) Change the log level to DEBUG for recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783961#comment-15783961 ] Robert Kanter commented on YARN-4882: - +1 > Change the log level to DEBUG for recovering completed applications > --- > > Key: YARN-4882 > URL: https://issues.apache.org/jira/browse/YARN-4882 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Rohith Sharma K S >Assignee: Daniel Templeton > Labels: oct16-easy > Attachments: YARN-4882.001.patch, YARN-4882.002.patch, > YARN-4882.003.patch, YARN-4882.004.patch, YARN-4882.005.patch > > > I think for recovering completed applications no need to log as INFO, rather > it can be made it as DEBUG. The problem seen from large cluster is if any > issue happens during RM start up and continuously switching , then RM logs > are filled with most with recovering applications only. > There are 6 lines are logged for 1 applications as I shown in below logs, > then consider RM default value for max-completed applications is 10K. So for > each switch 10K*6=60K lines will be added which is not useful I feel. > {noformat} > 2016-03-01 10:20:59,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Default priority > level is set to application:application_1456298208485_21507 > 2016-03-01 10:20:59,094 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering > app: application_1456298208485_21507 with 1 attempts and final state = > FINISHED > 2016-03-01 10:20:59,100 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Recovering attempt: appattempt_1456298208485_21507_01 with final state: > FINISHED > 2016-03-01 10:20:59,107 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1456298208485_21507_01 State change from NEW to FINISHED > 2016-03-01 10:20:59,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1456298208485_21507 State change from NEW to FINISHED > 2016-03-01 10:20:59,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=rohith > OPERATION=Application Finished - Succeeded TARGET=RMAppManager > RESULT=SUCCESS APPID=application_1456298208485_21507 > {noformat} > The main problem is missing important information's from the logs before RM > unstable. Even though log roll back is 50 or 100, in a short period all these > logs will be rolled out and all the logs contains only RM switching > information that too recovering applications!!. > I suggest at least completed applications recovery should be logged as DEBUG. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4882) Change the log level to DEBUG for recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783954#comment-15783954 ] Hadoop QA commented on YARN-4882: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 12m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s{color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 252 unchanged - 2 fixed = 252 total (was 254) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 38m 59s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 59m 51s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMRestart | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:a9ad5d6 | | JIRA Issue | YARN-4882 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12844988/YARN-4882.005.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 271abc327851 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 9ca54f4 | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/14486/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/14486/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/14486/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This
[jira] [Updated] (YARN-6038) Check other resource requests if cannot match the first one while identifying containers to preempt
[ https://issues.apache.org/jira/browse/YARN-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yufei Gu updated YARN-6038: --- Issue Type: Sub-task (was: Improvement) Parent: YARN-5990 > Check other resource requests if cannot match the first one while identifying > containers to preempt > --- > > Key: YARN-6038 > URL: https://issues.apache.org/jira/browse/YARN-6038 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Yufei Gu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6038) Check other resource requests if cannot match the first one while identifying containers to preempt
[ https://issues.apache.org/jira/browse/YARN-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yufei Gu updated YARN-6038: --- Summary: Check other resource requests if cannot match the first one while identifying containers to preempt (was: Check other resource requests if cannot match the first one while identify containers to preempt) > Check other resource requests if cannot match the first one while identifying > containers to preempt > --- > > Key: YARN-6038 > URL: https://issues.apache.org/jira/browse/YARN-6038 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Reporter: Yufei Gu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6038) Check other resource requests if cannot match the first one while identify containers to preempt
[ https://issues.apache.org/jira/browse/YARN-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yufei Gu updated YARN-6038: --- Summary: Check other resource requests if cannot match the first one while identify containers to preempt (was: Check other resource requests if cannot match the first one) > Check other resource requests if cannot match the first one while identify > containers to preempt > > > Key: YARN-6038 > URL: https://issues.apache.org/jira/browse/YARN-6038 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Reporter: Yufei Gu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5258) Document Use of Docker with LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Templeton updated YARN-5258: --- Attachment: YARN-5258.004.patch Looks like I missed [~tangzhankun]'s comments before. This patch addresses those and a couple of other minor issues I just caught. [~sidharta-s] or [~vvasudev], any comments? > Document Use of Docker with LinuxContainerExecutor > -- > > Key: YARN-5258 > URL: https://issues.apache.org/jira/browse/YARN-5258 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Affects Versions: 2.8.0 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Critical > Labels: oct16-easy > Attachments: YARN-5258.001.patch, YARN-5258.002.patch, > YARN-5258.003.patch, YARN-5258.004.patch > > > There aren't currently any docs that explain how to configure Docker and all > of its various options aside from reading all of the JIRAs. We need to > document the configuration, use, and troubleshooting, along with helpful > examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6038) Check other resource requests if cannot match the first one
[ https://issues.apache.org/jira/browse/YARN-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yufei Gu updated YARN-6038: --- Summary: Check other resource requests if cannot match the first one (was: Check other resource requests if we can't match the first one) > Check other resource requests if cannot match the first one > --- > > Key: YARN-6038 > URL: https://issues.apache.org/jira/browse/YARN-6038 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Reporter: Yufei Gu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5849) Automatically create YARN control group for pre-mounted cgroups
[ https://issues.apache.org/jira/browse/YARN-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783916#comment-15783916 ] Daniel Templeton commented on YARN-5849: Latest patch looks good to me. [~bibinchundatt], any additional comments? > Automatically create YARN control group for pre-mounted cgroups > --- > > Key: YARN-5849 > URL: https://issues.apache.org/jira/browse/YARN-5849 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.7.3, 3.0.0-alpha1, 3.0.0-alpha2 >Reporter: Miklos Szegedi >Assignee: Miklos Szegedi >Priority: Minor > Attachments: YARN-5849.000.patch, YARN-5849.001.patch, > YARN-5849.002.patch, YARN-5849.003.patch, YARN-5849.004.patch, > YARN-5849.005.patch, YARN-5849.006.patch, YARN-5849.007.patch, > YARN-5849.008.patch > > > Yarn can be launched with linux-container-executor.cgroups.mount set to > false. It will search for the cgroup mount paths set up by the administrator > parsing the /etc/mtab file. You can also specify > resource.percentage-physical-cpu-limit to limit the CPU resources assigned to > containers. > linux-container-executor.cgroups.hierarchy is the root of the settings of all > YARN containers. If this is specified but not created YARN will fail at > startup: > Caused by: java.io.FileNotFoundException: > /cgroups/cpu/hadoop-yarn/cpu.cfs_period_us (Permission denied) > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.updateCgroup(CgroupsLCEResourcesHandler.java:263) > This JIRA is about automatically creating YARN control group in the case > above. It reduces the cost of administration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5554) MoveApplicationAcrossQueues does not check user permission on the target queue
[ https://issues.apache.org/jira/browse/YARN-5554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783908#comment-15783908 ] Daniel Templeton commented on YARN-5554: Let's get this thing closed out. A few more comments: * In {{ClientRMService}}, {code}Server.getRemoteAddress(), null, targetQueue)||{code} should have a space before the pipes * In the new {{QueueACLsManager.checkAccess()}}, I'd really appreciate a comment that sums up the previous discussion on this JIRA so that the next person is less confused than I was * In {{TestClientRMService.getQueueAclManager()}}, the {{answer()}} method in the anonymous inner class should have an {{@Override}} annotation. Also, I think you'll run into problems with Java 7 and the non-final parameters being used inside the anonymous inner class * Same comments for {{createClientRMServiceForMoveApplicationRequest()}}, plus you shouldn't need the suppress warnings annotation now that YARN-4457 is in. You will need it in branch-2, though. > MoveApplicationAcrossQueues does not check user permission on the target queue > -- > > Key: YARN-5554 > URL: https://issues.apache.org/jira/browse/YARN-5554 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Haibo Chen >Assignee: Wilfred Spiegelenburg > Labels: oct16-medium > Attachments: YARN-5554.10.patch, YARN-5554.11.patch, > YARN-5554.2.patch, YARN-5554.3.patch, YARN-5554.4.patch, YARN-5554.5.patch, > YARN-5554.6.patch, YARN-5554.7.patch, YARN-5554.8.patch, YARN-5554.9.patch > > > moveApplicationAcrossQueues operation currently does not check user > permission on the target queue. This incorrectly allows one user to move > his/her own applications to a queue that the user has no access to -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5554) MoveApplicationAcrossQueues does not check user permission on the target queue
[ https://issues.apache.org/jira/browse/YARN-5554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783850#comment-15783850 ] Daniel Templeton commented on YARN-5554: Yep, I noticed that as well. The {{remoteAddress}} and {{forwardedAddress}} parameters that are the reason for the special casing are actually ignored under the covers, resulting in all the schedulers doing the same thing anyway. > MoveApplicationAcrossQueues does not check user permission on the target queue > -- > > Key: YARN-5554 > URL: https://issues.apache.org/jira/browse/YARN-5554 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2 >Reporter: Haibo Chen >Assignee: Wilfred Spiegelenburg > Labels: oct16-medium > Attachments: YARN-5554.10.patch, YARN-5554.11.patch, > YARN-5554.2.patch, YARN-5554.3.patch, YARN-5554.4.patch, YARN-5554.5.patch, > YARN-5554.6.patch, YARN-5554.7.patch, YARN-5554.8.patch, YARN-5554.9.patch > > > moveApplicationAcrossQueues operation currently does not check user > permission on the target queue. This incorrectly allows one user to move > his/her own applications to a queue that the user has no access to -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4882) Change the log level to DEBUG for recovering completed applications
[ https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Templeton updated YARN-4882: --- Attachment: YARN-4882.005.patch Changed the colon in the attempt message to an equals. > Change the log level to DEBUG for recovering completed applications > --- > > Key: YARN-4882 > URL: https://issues.apache.org/jira/browse/YARN-4882 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Rohith Sharma K S >Assignee: Daniel Templeton > Labels: oct16-easy > Attachments: YARN-4882.001.patch, YARN-4882.002.patch, > YARN-4882.003.patch, YARN-4882.004.patch, YARN-4882.005.patch > > > I think for recovering completed applications no need to log as INFO, rather > it can be made it as DEBUG. The problem seen from large cluster is if any > issue happens during RM start up and continuously switching , then RM logs > are filled with most with recovering applications only. > There are 6 lines are logged for 1 applications as I shown in below logs, > then consider RM default value for max-completed applications is 10K. So for > each switch 10K*6=60K lines will be added which is not useful I feel. > {noformat} > 2016-03-01 10:20:59,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Default priority > level is set to application:application_1456298208485_21507 > 2016-03-01 10:20:59,094 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering > app: application_1456298208485_21507 with 1 attempts and final state = > FINISHED > 2016-03-01 10:20:59,100 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Recovering attempt: appattempt_1456298208485_21507_01 with final state: > FINISHED > 2016-03-01 10:20:59,107 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1456298208485_21507_01 State change from NEW to FINISHED > 2016-03-01 10:20:59,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1456298208485_21507 State change from NEW to FINISHED > 2016-03-01 10:20:59,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=rohith > OPERATION=Application Finished - Succeeded TARGET=RMAppManager > RESULT=SUCCESS APPID=application_1456298208485_21507 > {noformat} > The main problem is missing important information's from the logs before RM > unstable. Even though log roll back is 50 or 100, in a short period all these > logs will be rolled out and all the logs contains only RM switching > information that too recovering applications!!. > I suggest at least completed applications recovery should be logged as DEBUG. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5709) Cleanup leader election configs and pluggability
[ https://issues.apache.org/jira/browse/YARN-5709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Templeton updated YARN-5709: --- Attachment: yarn-5709-branch-2.8.03.patch Forgot we were talking about branch-2.8, so the suppress warnings is still needed. In this patch I unjavadoced the comment. It passes test-patch for me. We'll see. > Cleanup leader election configs and pluggability > > > Key: YARN-5709 > URL: https://issues.apache.org/jira/browse/YARN-5709 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > Fix For: 2.9.0, 3.0.0-alpha2 > > Attachments: yarn-5709-branch-2.8.01.patch, > yarn-5709-branch-2.8.02.patch, yarn-5709-branch-2.8.03.patch, > yarn-5709-branch-2.8.patch, yarn-5709-wip.2.patch, yarn-5709.1.patch, > yarn-5709.2.patch, yarn-5709.3.patch, yarn-5709.4.patch > > > While reviewing YARN-5677 and YARN-5694, I noticed we could make the > curator-based election code cleaner. It is nicer to get this fixed in 2.8 > before we ship it, but this can be done at a later time as well. > # By EmbeddedElector, we meant it was running as part of the RM daemon. Since > the Curator-based elector is also running embedded, I feel the code should be > checking for {{!curatorBased}} instead of {{isEmbeddedElector}} > # {{LeaderElectorService}} should probably be named > {{CuratorBasedEmbeddedElectorService}} or some such. > # The code that initializes the elector should be at the same place > irrespective of whether it is curator-based or not. > # We seem to be caching the CuratorFramework instance in RM. It makes more > sense for it to be in RMContext. If others are okay with it, we might even be > better of having {{RMContext#getCurator()}} method to lazily create the > curator framework and then cache it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5275) Timeline application page cannot be loaded when no application submitted/running on the cluster after HADOOP-9613
[ https://issues.apache.org/jira/browse/YARN-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783754#comment-15783754 ] Daniel Templeton commented on YARN-5275: Ping, [~sunilg]... > Timeline application page cannot be loaded when no application > submitted/running on the cluster after HADOOP-9613 > - > > Key: YARN-5275 > URL: https://issues.apache.org/jira/browse/YARN-5275 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha1 >Reporter: Tsuyoshi Ozawa >Priority: Critical > > After HADOOP-9613, Timeline Web UI has a problem reported by [~leftnoteasy] > and [~sunilg] > {quote} > when no application submitted/running on the cluster, applications page > cannot be loaded. > {quote} > We should investigate the reason and fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode
[ https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783757#comment-15783757 ] Daniel Templeton commented on YARN-2962: No worries. I'm happy to help with the review. > ZKRMStateStore: Limit the number of znodes under a znode > > > Key: YARN-2962 > URL: https://issues.apache.org/jira/browse/YARN-2962 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Karthik Kambatla >Assignee: Varun Saxena >Priority: Critical > Attachments: YARN-2962.01.patch, YARN-2962.04.patch, > YARN-2962.05.patch, YARN-2962.2.patch, YARN-2962.3.patch > > > We ran into this issue where we were hitting the default ZK server message > size configs, primarily because the message had too many znodes even though > they individually they were all small. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-4401) A failed app recovery should not prevent the RM from starting
[ https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Templeton resolved YARN-4401. Resolution: Won't Fix This JIRA is superseded by YARN-6035, YARN-6036, and YARN-6037, which capture the same idea but more supportably. > A failed app recovery should not prevent the RM from starting > - > > Key: YARN-4401 > URL: https://issues.apache.org/jira/browse/YARN-4401 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Daniel Templeton >Assignee: Daniel Templeton >Priority: Critical > Attachments: YARN-4401.001.patch > > > There are many different reasons why an app recovery could fail with an > exception, causing the RM start to be aborted. If that happens the RM will > fail to start. Presumably, the reason the RM is trying to do a recovery is > that it's the standby trying to fill in for the active. Failing to come up > defeats the purpose of the HA configuration. Instead of preventing the RM > from starting, a failed app recovery should log an error and skip the > application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783546#comment-15783546 ] Daniel Templeton edited comment on YARN-6031 at 12/28/16 7:22 PM: -- bq. IIUC ignore validation on recovery also should work. Then you end up with unschedulable apps in the system, which can't be good. bq. We could ignore/reset labels to default in resourcerequest when nodelabels are disabled. The issue there is that the labels may have some important meaning to the job, so defaulting the labels may be bad. As applications can have side-effects, I think it's better to have the failure up front than let the application potentially fail somewhere down the line. The admin is then immediately made aware that he screwed up by disabling a feature that was still in use. was (Author: templedf): bq. IIUC ignore validation on recovery also should work. Then you end up with unschedulable apps in the system, which can't be good. bg. We could ignore/reset labels to default in resourcerequest when nodelabels are disabled. The issue there is that the labels may have some important meaning to the job, so defaulting the labels may be bad. As applications can have side-effects, I think it's better to have the failure up front than let the application potentially fail somewhere down the line. The admin is then immediately made aware that he screwed up by disabling a feature that was still in use. > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783546#comment-15783546 ] Daniel Templeton commented on YARN-6031: bq. IIUC ignore validation on recovery also should work. Then you end up with unschedulable apps in the system, which can't be good. bg. We could ignore/reset labels to default in resourcerequest when nodelabels are disabled. The issue there is that the labels may have some important meaning to the job, so defaulting the labels may be bad. As applications can have side-effects, I think it's better to have the failure up front than let the application potentially fail somewhere down the line. The admin is then immediately made aware that he screwed up by disabling a feature that was still in use. > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5831) Propagate allowPreemptionFrom flag all the way down to the app
[ https://issues.apache.org/jira/browse/YARN-5831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783486#comment-15783486 ] Karthik Kambatla commented on YARN-5831: Thanks for working on this, Yufei. Comments on the patch: # FSAppAttempt: ## Should we have a method called isPreemptable() in Schedulable and override it? Also, we could may be add unit tests for isPreemptable() separately then? ## canContainerBePreempted(): Nothing to do with this patch, but I wonder if we should check isPreemptable() first? # updatePreemptionVariables refactor: I like the idea of not recursing for updating preemption variables. ## We should make sure that queue.init() is called in a pre-order fashion on config update. Right now, we rely on the iterator (queues.values()). Is that guaranteed to show items in preorder? ## How do we guard against future changes where one updates a parent preemption config outside of config changes and that does not propagate to children? How likely are such changes? Do we even need to worry about them now? ## Nothing to do with this patch, but we should likely rename FSQueue#init to FSQueue#reinit? > Propagate allowPreemptionFrom flag all the way down to the app > -- > > Key: YARN-5831 > URL: https://issues.apache.org/jira/browse/YARN-5831 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Karthik Kambatla >Assignee: Yufei Gu > Attachments: YARN-5831.001.patch, YARN-5831.002.patch > > > FairScheduler allows disallowing preemption from a queue. When checking if > preemption for an application is allowed, the new preemption code recurses > all the way to the root queue to check this flag. > Propagating this information all the way to the app will be more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6038) Check other resource requests if we can't match the first one
Yufei Gu created YARN-6038: -- Summary: Check other resource requests if we can't match the first one Key: YARN-6038 URL: https://issues.apache.org/jira/browse/YARN-6038 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Yufei Gu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6037) Add an option to yarn resourcemanager CLI to list all applications that would cause a recovery failure
Daniel Templeton created YARN-6037: -- Summary: Add an option to yarn resourcemanager CLI to list all applications that would cause a recovery failure Key: YARN-6037 URL: https://issues.apache.org/jira/browse/YARN-6037 Project: Hadoop YARN Issue Type: Improvement Reporter: Daniel Templeton Today the RM will fail and exit on the first application recovery failure. It is often desirable to know the full list of applications that would fail recovery. This JIRA proposes to add a CLI option to have the RM complete recovery regardless of failures and then exit after printing the full list of applications that could not be recovered along with reasons. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783397#comment-15783397 ] Bibin A Chundatt edited comment on YARN-6031 at 12/28/16 6:31 PM: -- As [~sunilg] mentioned earlier ignoring application could create stale application in state store. [~Ying Zhang] IIUC ignore validation on recovery also should work. {code} private static void validateResourceRequest(ResourceRequest resReq, Resource maximumResource, QueueInfo queueInfo, RMContext rmContext) throws InvalidResourceRequestException { Configuration conf = rmContext.getYarnConfiguration(); // If Node label is not enabled throw exception if (null != conf && !YarnConfiguration.areNodeLabelsEnabled(conf)) { String labelExp = resReq.getNodeLabelExpression(); if (!(RMNodeLabelsManager.NO_LABEL.equals(labelExp) || null == labelExp)) { throw new InvalidLabelResourceRequestException( "Invalid resource request, node label not enabled " + "but request contains label expression"); } } {code} Thoughts?? {quote} The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery {quote} IMHO should be acceptable since any application submitted with labels when feature is disabled gets rejected. Solution 2: We could ignore/reset labels to default in resourcerequest when nodelabels are disabled. Havn't looked at impact of the same. An elaborate testing would be needed to see how metrics are impacted. Disadvantage is client will never get to know that reset happened in RM side YARN-4562 will try to handle ignore loading label configuration when disabled. [~templedf] i do agree that admin would require some way to get application info when recovery fails so that bulk update in state store is possible. was (Author: bibinchundatt): As [~sunilg] mentioned earlier ignoring application could create stale application in state store. [~Ying Zhang] IIUC ignore validation on recovery also should work. {code} private static void validateResourceRequest(ResourceRequest resReq, Resource maximumResource, QueueInfo queueInfo, RMContext rmContext) throws InvalidResourceRequestException { Configuration conf = rmContext.getYarnConfiguration(); // If Node label is not enabled throw exception if (null != conf && !YarnConfiguration.areNodeLabelsEnabled(conf)) { String labelExp = resReq.getNodeLabelExpression(); if (!(RMNodeLabelsManager.NO_LABEL.equals(labelExp) || null == labelExp)) { throw new InvalidLabelResourceRequestException( "Invalid resource request, node label not enabled " + "but request contains label expression"); } } {code} Thoughts?? {quote} The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery {quote} IMHO should be acceptable since any application submitted with labels when feature is disabled gets rejected. Solution 2: We could ignore/reset labels to default in resourcerequest when nodelabels are disabled. Havn't looked at impact of the same. An elaborate testing would be needed to see how metrics are impacted. Disadvantage is client will never get to know that reset happened in RM side YARN-4562 will try to handle ignore loading label configuration when disabled. [~templedf] i do agree that admin would require some way to dump application info when recovery fails so that bulk update in state store is possible. > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at >
[jira] [Created] (YARN-6036) Add -show-application-info option to yarn resourcemanager CLI
Daniel Templeton created YARN-6036: -- Summary: Add -show-application-info option to yarn resourcemanager CLI Key: YARN-6036 URL: https://issues.apache.org/jira/browse/YARN-6036 Project: Hadoop YARN Issue Type: Improvement Components: client Reporter: Daniel Templeton If an application has failed, the admin can purge it with {{-remove-application-from-state-store}}, but she has no information at that time about what the offending job is. This JIRA proposes to add a {{-show-application-info}} switch to the resourcemanager CLI that will print the application information without needing a running RM so that the admin can make an informed decision. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6035) Add -force-recovery option to yarn resourcemanager
Daniel Templeton created YARN-6035: -- Summary: Add -force-recovery option to yarn resourcemanager Key: YARN-6035 URL: https://issues.apache.org/jira/browse/YARN-6035 Project: Hadoop YARN Issue Type: Improvement Components: client Reporter: Daniel Templeton If multiple applications cannot be recovered, the admin is forced to repeatedly attempt to start the RM, check the logs, and purge the offending app. Worse, the admin has no information about the app that was purged, other than the ID. This JIRA proposes to add a {{-force-recovery}} option to the resourcemanager CLI that will automatically purge any apps that fail to recover after dumping the full application profile to a log. It would nice if the option prompts the users with an "are you sure?" before continuing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-5824) Verify app starvation under custom preemption thresholds and timeouts
[ https://issues.apache.org/jira/browse/YARN-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yufei Gu reassigned YARN-5824: -- Assignee: Yufei Gu > Verify app starvation under custom preemption thresholds and timeouts > - > > Key: YARN-5824 > URL: https://issues.apache.org/jira/browse/YARN-5824 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Karthik Kambatla >Assignee: Yufei Gu > > YARN-5783 adds basic tests to verify applications are identified to be > starved. This JIRA is to add more advanced tests for different values of > preemption thresholds and timeouts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783397#comment-15783397 ] Bibin A Chundatt edited comment on YARN-6031 at 12/28/16 6:08 PM: -- As [~sunilg] mentioned earlier ignoring application could create stale application in state store. [~Ying Zhang] IIUC ignore validation on recovery also should work. {code} private static void validateResourceRequest(ResourceRequest resReq, Resource maximumResource, QueueInfo queueInfo, RMContext rmContext) throws InvalidResourceRequestException { Configuration conf = rmContext.getYarnConfiguration(); // If Node label is not enabled throw exception if (null != conf && !YarnConfiguration.areNodeLabelsEnabled(conf)) { String labelExp = resReq.getNodeLabelExpression(); if (!(RMNodeLabelsManager.NO_LABEL.equals(labelExp) || null == labelExp)) { throw new InvalidLabelResourceRequestException( "Invalid resource request, node label not enabled " + "but request contains label expression"); } } {code} Thoughts?? {quote} The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery {quote} IMHO should be acceptable since any application submitted with labels when feature is disabled gets rejected. Solution 2: We could ignore/reset labels to default in resourcerequest when nodelabels are disabled. Havn't looked at impact of the same. An elaborate testing would be needed to see how metrics are impacted. Disadvantage is client will never get to know that reset happened in RM side YARN-4562 will try to handle ignore loading label configuration when disabled. [~templedf] i do agree that admin would require some way to dump application info when recovery fails so that bulk update in state store is possible. was (Author: bibinchundatt): As [~sunilg] mentioned earlier ignoring application could create stale application in state store. [~Ying Zhang] IIUC ignore validation on recovery also should work. {code} private static void validateResourceRequest(ResourceRequest resReq, Resource maximumResource, QueueInfo queueInfo, RMContext rmContext) throws InvalidResourceRequestException { Configuration conf = rmContext.getYarnConfiguration(); // If Node label is not enabled throw exception if (null != conf && !YarnConfiguration.areNodeLabelsEnabled(conf)) { String labelExp = resReq.getNodeLabelExpression(); if (!(RMNodeLabelsManager.NO_LABEL.equals(labelExp) || null == labelExp)) { throw new InvalidLabelResourceRequestException( "Invalid resource request, node label not enabled " + "but request contains label expression"); } } {code} Thoughts?? {quote} The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery {quote} IMHO should be acceptable since any application submitted with labels when feature is disabled gets rejected. Solution 2: We could ignore/reset labels to default in resourcerequest when nodelabels are disabled. Havn't looked at impact of the same. An elaborate testing would be needed to see how metrics are impacted. YARN-4562 will try to handle ignore loading label configuration when disabled. [~templedf] i do agree that admin would require some way to dump application info when recovery fails so that bulk update in state store is possible. > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at >
[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783397#comment-15783397 ] Bibin A Chundatt commented on YARN-6031: As [~sunilg] mentioned earlier ignoring application could create stale application in state store. [~Ying Zhang] IIUC ignore validation on recovery also should work. {code} private static void validateResourceRequest(ResourceRequest resReq, Resource maximumResource, QueueInfo queueInfo, RMContext rmContext) throws InvalidResourceRequestException { Configuration conf = rmContext.getYarnConfiguration(); // If Node label is not enabled throw exception if (null != conf && !YarnConfiguration.areNodeLabelsEnabled(conf)) { String labelExp = resReq.getNodeLabelExpression(); if (!(RMNodeLabelsManager.NO_LABEL.equals(labelExp) || null == labelExp)) { throw new InvalidLabelResourceRequestException( "Invalid resource request, node label not enabled " + "but request contains label expression"); } } {code} Thoughts?? {quote} The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery {quote} IMHO should be acceptable since any application submitted with labels when feature is disabled gets rejected. Solution 2: We could ignore/reset labels to default in resourcerequest when nodelabels are disabled. Havn't looked at impact of the same. An elaborate testing would be needed to see how metrics are impacted. YARN-4562 will try to handle ignore loading label configuration when disabled. [~templedf] i do agree that admin would require some way to dump application info when recovery fails so that bulk update in state store is possible. > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5969) FairShareComparator: Cache value of getResourceUsage for better performance
[ https://issues.apache.org/jira/browse/YARN-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783392#comment-15783392 ] Yufei Gu commented on YARN-5969: Absolutely! [~zsl2007], thanks for working on this. Any contribution to community is welcome! > FairShareComparator: Cache value of getResourceUsage for better performance > --- > > Key: YARN-5969 > URL: https://issues.apache.org/jira/browse/YARN-5969 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: zhangshilong >Assignee: zhangshilong > Fix For: 2.9.0, 3.0.0-alpha2 > > Attachments: 20161206.patch, 20161222.patch, YARN-5969.patch, > apprunning_after.png, apprunning_before.png, > containerAllocatedDelta_before.png, containerAllocated_after.png, > pending_after.png, pending_before.png > > > in FairShareComparator class, the performance of function getResourceUsage() > is very poor. It will be executed above 100,000,000 times per second. > In our scene, It takes 20 seconds per minute. > A simple solution is to reduce call counts of the function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5257) Fix unreleased resources and null dereferences
[ https://issues.apache.org/jira/browse/YARN-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783376#comment-15783376 ] Yufei Gu commented on YARN-5257: Thanks [~rkanter] for the review and commit. > Fix unreleased resources and null dereferences > -- > > Key: YARN-5257 > URL: https://issues.apache.org/jira/browse/YARN-5257 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Yufei Gu >Assignee: Yufei Gu > Fix For: 2.9.0, 3.0.0-alpha2 > > Attachments: YARN-5257.001.patch > > > The following code contain potential problems: > {code} > Unreleased Resource: Streams TopCLI.java:738 > Unreleased Resource: Streams Graph.java:189 > Unreleased Resource: Streams CgroupsLCEResourcesHandler.java:291 > Unreleased Resource: Streams UnmanagedAMLauncher.java:195 > Unreleased Resource: Streams CGroupsHandlerImpl.java:319 > Unreleased Resource: Streams TrafficController.java:629 > Null Dereference ApplicationImpl.java:465 > Null Dereference VisualizeStateMachine.java:52 > Null Dereference ContainerImpl.java:1089 > Null Dereference QueueManager.java:219 > Null Dereference QueueManager.java:232 > Null Dereference ResourceLocalizationService.java:1016 > Null Dereference ResourceLocalizationService.java:1023 > Null Dereference ResourceLocalizationService.java:1040 > Null Dereference ResourceLocalizationService.java:1052 > Null Dereference ProcfsBasedProcessTree.java:802 > Null Dereference TimelineClientImpl.java:639 > Null Dereference LocalizedResource.java:206 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5798) Handle FSPreemptionThread crashing due to a RuntimeException
[ https://issues.apache.org/jira/browse/YARN-5798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783375#comment-15783375 ] Karthik Kambatla commented on YARN-5798: Oh, and in RMFatalEventType, instead of calling it OTHERS, we should likely have a more descriptive name. > Handle FSPreemptionThread crashing due to a RuntimeException > > > Key: YARN-5798 > URL: https://issues.apache.org/jira/browse/YARN-5798 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Affects Versions: 2.9.0 >Reporter: Karthik Kambatla >Assignee: Yufei Gu > Attachments: YARN-5798.001.patch > > > YARN-5605 added an FSPreemptionThread. The run() method catches > InterruptedException. If this were to run into a RuntimeException, the > preemption thread would crash. We should probably fail the RM itself (or > transition to standby) when this happens. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5906) Update AppSchedulingInfo to use SchedulingPlacementSet
[ https://issues.apache.org/jira/browse/YARN-5906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783373#comment-15783373 ] Hudson commented on YARN-5906: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11050 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/11050/]) YARN-5906. Update AppSchedulingInfo to use SchedulingPlacementSet. (sunilg: rev 9ca54f4810de182195263bd594afb56dab564105) * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/LocalitySchedulingPlacementSet.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/SchedulingPlacementSet.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimitsByPartition.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java > Update AppSchedulingInfo to use SchedulingPlacementSet > -- > > Key: YARN-5906 > URL: https://issues.apache.org/jira/browse/YARN-5906 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 3.0.0-alpha2 > > Attachments: YARN-5906.1.patch, YARN-5906.2.patch, YARN-5906.3.patch, > YARN-5906.4.patch, YARN-5906.5.patch > > > Currently AppSchedulingInfo simply stores resource request and scheduler make > decision according to stored resource request. For example, CS/FS use > slightly different approach to get pending resource request and make delay > scheduling decision. > There're several benefits of moving pending resource request data structure > to SchedulingPlacementSet > 1) Delay scheduling logic should be agnostic to scheduler, for example CS > supports count-based delay and FS supports both of count-based and time-based > delay. Ideally scheduler should be able to choose which delay scheduling > policy to use. > 2) In addition to 1., YARN-4902 has proposal to support pluggable delay > scheduling behavior in addition to locality-based (host->rack->offswitch). > Which requires more flexibility. > 3) To make YARN-4902 becomes real, instead of directly adding the new > resource request API to client, we can make scheduler to use it internally to > make sure it is well defined. And AppSchedulingInfo/SchedulingPlacementSet > will be the perfect place to isolate which ResourceRequest implementation to > use. > 4) Different scheduling requirement needs different behavior of checking > ResourceRequest table. > This JIRA is the 1st patch of several refactorings. Which moves all > ResourceRequest data structure and logics to SchedulingPlacementSet. We need > follow changes to make it better structured > - Make delay scheduling to be a plugin of SchedulingPlacementSet > - After YARN-4902 get committed, change SchedulingPlacementSet to use > YARN-4902 internally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5798) Handle FSPreemptionThread crashing due to a RuntimeException
[ https://issues.apache.org/jira/browse/YARN-5798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783370#comment-15783370 ] Karthik Kambatla commented on YARN-5798: Thanks for picking this up, Yufei. Comments on the patch: # Interrupting the thread is the only way to stop the thread. So, I think we should just return (and not transition to standby) for InterruptedException. # For RTE, how about using a custom {{Thread.UncaughtExceptionHandler}}. Other threads in FairScheduler do not handle RTEs either. We could use the same handler for all these threads. Might not be a bad idea to increase the scope of this JIRA and include those as well? # We should also add tests. > Handle FSPreemptionThread crashing due to a RuntimeException > > > Key: YARN-5798 > URL: https://issues.apache.org/jira/browse/YARN-5798 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Affects Versions: 2.9.0 >Reporter: Karthik Kambatla >Assignee: Yufei Gu > Attachments: YARN-5798.001.patch > > > YARN-5605 added an FSPreemptionThread. The run() method catches > InterruptedException. If this were to run into a RuntimeException, the > preemption thread would crash. We should probably fail the RM itself (or > transition to standby) when this happens. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783232#comment-15783232 ] Daniel Templeton commented on YARN-6031: I agree that {{-force-recovery}} could cause a significant information loss, but it's something that the admin has to do explicitly, and it's only application information, so it's not the end of the world. With a {{-dump-application-information}} option, the admin has the choice to either 1) look at each app that fails the recovery and decide whether to purge it or do something else (like turn node labels back on), or 2) do a bulk purge with {{-force-recovery}}. It might also be good to have another option, something like {{-dry-run-recovery}}, that would tell the admin the IDs of all the applications that will fail during recovery so that she doesn't have to keep doing them one at a time. In fact, I could even see making that the default behavior before failing the resource manager. In any case, I don't think the approach proposed in this JIRA, to just ignore the failed app, is going to work out. > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5931) Document timeout interfaces CLI and REST APIs
[ https://issues.apache.org/jira/browse/YARN-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783133#comment-15783133 ] Daniel Templeton commented on YARN-5931: I agree, but it should be "collection", not "collections". > Document timeout interfaces CLI and REST APIs > - > > Key: YARN-5931 > URL: https://issues.apache.org/jira/browse/YARN-5931 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > Attachments: ResourceManagerRest.html, YARN-5931.0.patch, > YARN-5931.1.patch, YARN-5931.2.patch, YARN-5931.3.patch, YarnCommands.html > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5906) Update AppSchedulingInfo to use SchedulingPlacementSet
[ https://issues.apache.org/jira/browse/YARN-5906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783106#comment-15783106 ] Sunil G commented on YARN-5906: --- Test case failures are unrelated. Will commit in a short while. > Update AppSchedulingInfo to use SchedulingPlacementSet > -- > > Key: YARN-5906 > URL: https://issues.apache.org/jira/browse/YARN-5906 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-5906.1.patch, YARN-5906.2.patch, YARN-5906.3.patch, > YARN-5906.4.patch, YARN-5906.5.patch > > > Currently AppSchedulingInfo simply stores resource request and scheduler make > decision according to stored resource request. For example, CS/FS use > slightly different approach to get pending resource request and make delay > scheduling decision. > There're several benefits of moving pending resource request data structure > to SchedulingPlacementSet > 1) Delay scheduling logic should be agnostic to scheduler, for example CS > supports count-based delay and FS supports both of count-based and time-based > delay. Ideally scheduler should be able to choose which delay scheduling > policy to use. > 2) In addition to 1., YARN-4902 has proposal to support pluggable delay > scheduling behavior in addition to locality-based (host->rack->offswitch). > Which requires more flexibility. > 3) To make YARN-4902 becomes real, instead of directly adding the new > resource request API to client, we can make scheduler to use it internally to > make sure it is well defined. And AppSchedulingInfo/SchedulingPlacementSet > will be the perfect place to isolate which ResourceRequest implementation to > use. > 4) Different scheduling requirement needs different behavior of checking > ResourceRequest table. > This JIRA is the 1st patch of several refactorings. Which moves all > ResourceRequest data structure and logics to SchedulingPlacementSet. We need > follow changes to make it better structured > - Make delay scheduling to be a plugin of SchedulingPlacementSet > - After YARN-4902 get committed, change SchedulingPlacementSet to use > YARN-4902 internally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5931) Document timeout interfaces CLI and REST APIs
[ https://issues.apache.org/jira/browse/YARN-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783083#comment-15783083 ] Rohith Sharma K S commented on YARN-5931: - I think this sentence is more meaning full. "When you run a GET operation on this resource, a collections of ApplicationTimeout object is returned" > Document timeout interfaces CLI and REST APIs > - > > Key: YARN-5931 > URL: https://issues.apache.org/jira/browse/YARN-5931 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > Attachments: ResourceManagerRest.html, YARN-5931.0.patch, > YARN-5931.1.patch, YARN-5931.2.patch, YARN-5931.3.patch, YarnCommands.html > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783073#comment-15783073 ] Sunil G commented on YARN-6031: --- Yes. Makes sense. This is more less a work for admin then. I am not so sure whether RM can take a call and internally remove. But it may be costly as we are deleting a user record. But if necessary informations are pushed to logs, then we may be good to remove data internally itself. Thoughts'? > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783025#comment-15783025 ] Daniel Templeton commented on YARN-6031: bq. max_applications may hit and valid apps may get emitted at some point of time. Exactly my concern. So far, the answer has been that the recovery should just fail, and the admin should clear the app. See your comments in YARN-4401. Personally, I think that's a bad experience for the admin, but resolving it will require a bit of infrastructure work to make something useful happen. I think a {{-force-recovery}} option that purges the bad apps after dumping full info to the logs would be a good start. It would also help to have an option to dump the info about an app from the state store without having the RM running, so that when using {{-remove-application-from-state-store}} you know what you're purging. > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5931) Document timeout interfaces CLI and REST APIs
[ https://issues.apache.org/jira/browse/YARN-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783002#comment-15783002 ] Daniel Templeton commented on YARN-5931: Thanks for the update. Two more things: * This one is still outstanding: "When you run a GET operation on this resource, you can obtain a collection of Application Timeout Objects." should be "When you run a GET operation on this resource, a collection of Application Timeout Objects is returned." * In the Cluster Applications API, you say that a collection "are" returned. It should be "is." > Document timeout interfaces CLI and REST APIs > - > > Key: YARN-5931 > URL: https://issues.apache.org/jira/browse/YARN-5931 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > Attachments: ResourceManagerRest.html, YARN-5931.0.patch, > YARN-5931.1.patch, YARN-5931.2.patch, YARN-5931.3.patch, YarnCommands.html > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782995#comment-15782995 ] Sunil G commented on YARN-6031: --- Yes [~templedf]. You are correct. We will end up having many flaky apps in state store. Offline had a chat with [~bibinchundatt] also, and there may be a potential problem with that too. max_applications may hit and valid apps may get emitted at some point of time. We can forcefully remove state store, however we may loose information abt it as its a failed app. With clear logging, we can evict such apps from state store. I am not so sure whether we need to delete immediately or can put to an async monitor to delete later. Thoughts? > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782984#comment-15782984 ] Daniel Templeton commented on YARN-6031: Yep, tests are needed. Love the long explanatory comment. Do you think we can make the log message a bit more explicit, i.e. say that the failure was because node labels have been disabled and point out the property that the admin should use to disable/enable node labels? Also, what happens to the app in the state store? If we fail to recover it and just ignore it, it will sit there forever, I suspect. It's probably a bad thing if the RM and state store don't agree on what apps are active. > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782814#comment-15782814 ] Sunil G edited comment on YARN-6031 at 12/28/16 1:04 PM: - Thanks [~Ying Zhang], Overall approach makes sense to me. You are basically trying to act on checking for label disabled check etc only when an exception is thrown. Its more or less same as the earlier suggestion. So its ok. However few more points. With this patch, now the app recovery will continue. - If App's AM resource request was not having any specific node label, but other containers may have. Since we send InvalidResourceRequest on those case, it might be fine for now. - In above case, what will happen to ongoing containers (w.r.t RM's data structure)?. If its a running app w/o any outstanding requests, we might need to consider running containers as for NO_LABEL. I am not sure whether this will happen as of today. I ll also check. These could be out of scope for this ticket. However lets check opinion from other folks as well. *Note*: Please add more test cases to cover this patch. was (Author: sunilg): Thanks [~Ying Zhang], Overall approach makes sense to me. You are basically trying to act on checking for label disabled check etc only when an exception is thrown. Its more or less same as the earlier suggestion. So its ok. However few more points. With this patch, now the app recovery will continue. - If App's AM resource request was not having any specific node label, but other containers may have. Since we send InvalidResourceRequest on those case, it might be fine for now. - In above case, what will happen to ongoing containers. If its a running app w/o any outstanding requests, we might need to do label change for running containers These could be out of scope for this ticket. However lets check opinion from other folks as well. NB: Please add more test cases to cover this patch. > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782814#comment-15782814 ] Sunil G commented on YARN-6031: --- Thanks [~Ying Zhang], Overall approach makes sense to me. You are basically trying to act on checking for label disabled check etc only when an exception is thrown. Its more or less same as the earlier suggestion. So its ok. However few more points. With this patch, now the app recovery will continue. - If App's AM resource request was not having any specific node label, but other containers may have. Since we send InvalidResourceRequest on those case, it might be fine for now. - In above case, what will happen to ongoing containers. If its a running app w/o any outstanding requests, we might need to do label change for running containers These could be out of scope for this ticket. However lets check opinion from other folks as well. NB: Please add more test cases to cover this patch. > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6024) Capacity Scheduler 'continuous reservation looking' doesn't work when sum of queue's used and reserved resources is equal to max
[ https://issues.apache.org/jira/browse/YARN-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782613#comment-15782613 ] Sunil G commented on YARN-6024: --- Committed {{YARN-6024.001.patch}} to trunk/branch-2/branch-2.8 and {{YARN-6024-branch-2.7.001.patch}} to branch-2.7. Thanks [~leftnoteasy] for the patch and thanks [~Ying Zhang] for additional review. > Capacity Scheduler 'continuous reservation looking' doesn't work when sum of > queue's used and reserved resources is equal to max > > > Key: YARN-6024 > URL: https://issues.apache.org/jira/browse/YARN-6024 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-6024-branch-2.7.001.patch, > YARN-6024-branch-2.7.001.patch, YARN-6024.001.patch > > > Found one corner case when continuous reservation looking doesn't work: > When queue's used=max, the queue's capacity check fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5719) Enforce a C standard for native container-executor
[ https://issues.apache.org/jira/browse/YARN-5719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782602#comment-15782602 ] Hudson commented on YARN-5719: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11049 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/11049/]) YARN-5719. Enforce a C standard for native container-executor. (vvasudev: rev 972da46cb48725ad49d3e0a033742bd1a8228f51) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/CMakeLists.txt > Enforce a C standard for native container-executor > -- > > Key: YARN-5719 > URL: https://issues.apache.org/jira/browse/YARN-5719 > Project: Hadoop YARN > Issue Type: Task > Components: nodemanager >Reporter: Chris Douglas >Assignee: Chris Douglas > Fix For: 3.0.0-alpha2 > > Attachments: YARN-5719.000.patch > > > The {{container-executor}} build should declare the C standard it uses. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782566#comment-15782566 ] Ying Zhang edited comment on YARN-6031 at 12/28/16 10:22 AM: - Uploaded a patch, which is based on [~leftnoteasy]'s comment on YARN-4465: swallow the InvalidResourceRequest exception when recovering, only fail the recovery for this application and print a error message, then let the rest of the recovery continue. [~sunilg], your suggestion also makes sense to me. Actually, the code change using your approach would be made at the same place as in this patch with small modification: in function recover(), inside the for loop, if the conditions are met, skip calling "recoverApplication" and log a message like "skip recover application ..." instead. Difference is that using this approach we'll always check for these conditions even though it might not be a normal case, while using the approach in the patch, we just need to react when the exception happens. I'm ok with each approach since the overhead is not that big. Let's see what others think:-) [~leftnoteasy], [~bibinchundatt] Just want to clarify. The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery, while application submitted without node label expression specified will succeed, no matter whether or not there is default node label expression for the target queue. This is due to the following code snippet, the calling for "checkQueueLabelInLabelManager" which will check if node label exists in node label manager(node label manager has no label at all if Node Label being disabled) has been skipped for recovery: {code:title=SchedulerUtils.java|borderStyle=solid} public static void normalizeAndValidateRequest(ResourceRequest resReq, Resource maximumResource, String queueName, YarnScheduler scheduler, boolean isRecovery, RMContext rmContext, QueueInfo queueInfo) throws InvalidResourceRequestException { ... ... SchedulerUtils.normalizeNodeLabelExpressionInRequest(resReq, queueInfo); if (!isRecovery) { validateResourceRequest(resReq, maximumResource, queueInfo, rmContext); // calling checkQueueLabelInLabelManager } {code} This is not exactly the same as what happens when submitting a job in normal case (i.e., not during recovery). While in normal case, when there is default node label expression defined for queue with node label disabled, the application will also get rejected due to invalid resource request even if it doesn't specify node label expression. I believe this will get fixed after YARN-4652 being addressed. was (Author: ying zhang): Uploaded a patch, which is based on [~leftnoteasy]'s comment on YARN-4465: swallow the InvalidResourceRequest exception when recovering, only fail the recovery for this application and print a error message, then let the rest of the recovery continue. [~sunilg], your suggestion also makes sense to me. Actually, the code change using your approach would be made at the same place as in this patch with small modification: in function recover(), inside the for loop, if the conditions are met, skip calling "recoverApplication" and log a message like "skip recover application ..." instead. Difference is that using this approach we'll always check for these conditions even though it might not be a normal case, while using the approach in the patch, we just need to react when the exception happens. I'm ok with each approach since the overhead is not that big. Let's see what others think:-) [~leftnoteasy], [~bibinchundatt] Just want to clarify. The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery, while application submitted without node label expression specified will succeed, no matter whether or not there is default node label expression for the target queue. This is due to the following code snippet, the calling for "checkQueueLabelInLabelManager" which will check if node label exists in node label manager(node label manager has no label at all if Node Label being disabled) has been skipped for recovery: {code:title=SchedulerUtils.java|borderStyle=solid} public static void normalizeAndValidateRequest(ResourceRequest resReq, Resource maximumResource, String queueName, YarnScheduler scheduler, boolean isRecovery, RMContext rmContext, QueueInfo queueInfo) throws InvalidResourceRequestException { ... ... SchedulerUtils.normalizeNodeLabelExpressionInRequest(resReq, queueInfo); if (!isRecovery) { validateResourceRequest(resReq, maximumResource, queueInfo, rmContext); // calling checkQueueLabelInLabelManager } {code} > Application recovery failed after disabling node label >
[jira] [Commented] (YARN-6029) CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to release a reserved
[ https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782578#comment-15782578 ] Tao Yang commented on YARN-6029: Thanks [~wangda] & [~Naganarasimha] ! {quote} Agree but IIUC based on 2.8 code its less dependent on locking of child queue as acls are updated during reinitialization all the queues at one shot {quote} We also noticed that it doesn't hold the lock of LeafQueue instance when updating acls (CapacityScheduler#setQueueAcls) so that current logic doesn't guarantee the consistency of acls. {quote} So to ensure acls are returned appropriately i presume we should be holding the lock on CS.getQueueUserAclInfo which is not happening currently in 2.8. {quote} I'm not clear about this. Is it worth to ensure consistency of acls through reducing the efficiency of scheduler? > CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by > Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to > release a reserved container > -- > > Key: YARN-6029 > URL: https://issues.apache.org/jira/browse/YARN-6029 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.8.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Blocker > Attachments: YARN-6029.001.patch, deadlock.jstack > > > When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls > YarnClient#getQueueAclsInfo) just at the moment that > LeafQueue#assignContainers is called and before notifying parent queue to > release resource (should release a reserved container), then ResourceManager > can deadlock. I found this problem on our testing environment for hadoop2.8. > Reproduce the deadlock in chronological order > * 1. Thread A (ResourceManager Event Processor) calls synchronized > LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a) > * 2. Thread B (IPC Server handler) calls synchronized > ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue > root), iterates over children queue acls and is blocked when calling > synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of > queue root.a is hold by Thread A) > * 3. Thread A wants to inform the parent queue that a container is being > completed and is blocked when invoking synchronized > ParentQueue#internalReleaseResource method (the ParentQueue instance lock of > queue root is hold by Thread B) > I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be > removed to solve this problem, since this method appears to not affect fields > of LeafQueue instance. > Attach patch with UT for review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ying Zhang updated YARN-6031: - Attachment: YARN-6031.001.patch > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782566#comment-15782566 ] Ying Zhang commented on YARN-6031: -- Uploaded a patch, which is based on [~leftnoteasy]'s comment on YARN-4465: swallow the InvalidResourceRequest exception when recovering, only fail the recovery for this application and print a error message, then let the rest of the recovery continue. [~sunilg], your suggestion also makes sense to me. Actually, the code change using your approach would be made at the same place as in this patch with small modification: in function recover(), inside the for loop, if the conditions are met, skip calling "recoverApplication" and log a message like "skip recover application ..." instead. Difference is that using this approach we'll always check for these conditions even though it might not be a normal case, while using the approach in the patch, we just need to react when the exception happens. I'm ok with each approach since the overhead is not that big. Let's see what others think:-) [~leftnoteasy], [~bibinchundatt] Just want to clarify. The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery, while application submitted without node label expression specified will succeed, no matter whether or not there is default node label expression for the target queue. This is due to the following code snippet, the calling for "checkQueueLabelInLabelManager" which will check if node label exists in node label manager(node label manager has no label at all if Node Label being disabled) has been skipped for recovery: {code:title=SchedulerUtils.java|borderStyle=solid} public static void normalizeAndValidateRequest(ResourceRequest resReq, Resource maximumResource, String queueName, YarnScheduler scheduler, boolean isRecovery, RMContext rmContext, QueueInfo queueInfo) throws InvalidResourceRequestException { ... ... SchedulerUtils.normalizeNodeLabelExpressionInRequest(resReq, queueInfo); if (!isRecovery) { validateResourceRequest(resReq, maximumResource, queueInfo, rmContext); // calling checkQueueLabelInLabelManager } {code} > Application recovery failed after disabling node label > -- > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6034) Add support better logging in container-executor
Varun Vasudev created YARN-6034: --- Summary: Add support better logging in container-executor Key: YARN-6034 URL: https://issues.apache.org/jira/browse/YARN-6034 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Currently, the container-executor doesn't have any support for log levels, etc. Add support for better logging and setting log levels by the invoker or config file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6033) Add support for sections in container-executor configuration file
Varun Vasudev created YARN-6033: --- Summary: Add support for sections in container-executor configuration file Key: YARN-6033 URL: https://issues.apache.org/jira/browse/YARN-6033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5719) Enforce a C standard for native container-executor
[ https://issues.apache.org/jira/browse/YARN-5719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-5719: Assignee: Chris Douglas > Enforce a C standard for native container-executor > -- > > Key: YARN-5719 > URL: https://issues.apache.org/jira/browse/YARN-5719 > Project: Hadoop YARN > Issue Type: Task > Components: nodemanager >Reporter: Chris Douglas >Assignee: Chris Douglas > Fix For: 3.0.0-alpha2 > > Attachments: YARN-5719.000.patch > > > The {{container-executor}} build should declare the C standard it uses. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5931) Document timeout interfaces CLI and REST APIs
[ https://issues.apache.org/jira/browse/YARN-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782520#comment-15782520 ] Hadoop QA commented on YARN-5931: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 52s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 12s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 41s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 40s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 48s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 1 new + 206 unchanged - 0 fixed = 207 total (was 206) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 3 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 1s{color} | {color:red} The patch 12 line(s) with tabs. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 34s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 36s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 39m 45s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 11s{color} | {color:green} hadoop-yarn-site in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} |
[jira] [Commented] (YARN-5931) Document timeout interfaces CLI and REST APIs
[ https://issues.apache.org/jira/browse/YARN-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782376#comment-15782376 ] Sunil G commented on YARN-5931: --- Thanks [~rohithsharma] Looks fine. I will wait for [~templedf] also. > Document timeout interfaces CLI and REST APIs > - > > Key: YARN-5931 > URL: https://issues.apache.org/jira/browse/YARN-5931 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > Attachments: ResourceManagerRest.html, YARN-5931.0.patch, > YARN-5931.1.patch, YARN-5931.2.patch, YARN-5931.3.patch, YarnCommands.html > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5931) Document timeout interfaces CLI and REST APIs
[ https://issues.apache.org/jira/browse/YARN-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-5931: Attachment: YARN-5931.3.patch Updated patch fixing review comments > Document timeout interfaces CLI and REST APIs > - > > Key: YARN-5931 > URL: https://issues.apache.org/jira/browse/YARN-5931 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Reporter: Rohith Sharma K S >Assignee: Rohith Sharma K S > Attachments: ResourceManagerRest.html, YARN-5931.0.patch, > YARN-5931.1.patch, YARN-5931.2.patch, YARN-5931.3.patch, YarnCommands.html > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org