[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile
[ https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204467#comment-17204467 ] Tao Yang commented on YARN-8737: Hi, [~Amithsha], [~wangda], [~bteke]. Sorry for missing this issue so long. I haven't dug into this issue or checked if the exception never happen again (I have just search the key words "Comparison method violates its general contract" from RM logs of our YARN clusters which can only be stored for 7 days, nothing returned) since this exception can't crash or affect the scheduling process in our internal versions. After looking into YARN-10178, I think this problem may be raised by multiple causes, the same point is that some resources like capacity-resource or used-resource in child queues(leaf or parent queue) changed while parent queue is sorting them. I think this patch can solve the problem for the configurations-updating scenario, adding read lock in ParentQueue#sortAndGetChildrenAllocationIterator can avoid the child queues' configured capacity be updated while being sorted. [~wangda], [~bteke] very appreciate if you can help to review and commit this patch. And we should also fix the problem for the scheduling scenario in YARN-10178. > Race condition in ParentQueue when reinitializing and sorting child queues in > the meanwhile > --- > > Key: YARN-8737 > URL: https://issues.apache.org/jira/browse/YARN-8737 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.3.0, 2.9.3, 3.2.2, 3.1.4 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Attachments: YARN-8737.001.patch > > > Administrator raised a update for queues through REST API, in RM parent queue > is refreshing child queues through calling ParentQueue#reinitialize, > meanwhile, async-schedule threads is sorting child queues when calling > ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen > and throw exception as follow because TimSort does not handle the concurrent > modification of objects it is sorting: > {noformat} > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at java.util.TimSort.mergeHi(TimSort.java:899) > at java.util.TimSort.mergeAt(TimSort.java:516) > at java.util.TimSort.mergeCollapse(TimSort.java:441) > at java.util.TimSort.sort(TimSort.java:245) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1454) > at java.util.Collections.sort(Collections.java:175) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962) > {noformat} > I think we can add read-lock for > ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the > write-lock will be hold when updating child queues in > ParentQueue#reinitialize. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151440#comment-17151440 ] Tao Yang commented on YARN-10319: - Thanks [~prabhujoseph] for updating the patch. The latest patch LGTM. [~adam.antal], could you please help to review again? Thanks. > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch, YARN-10319-005.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149285#comment-17149285 ] Tao Yang commented on YARN-10319: - Thanks [~adam.antal] for the review and comments, [~prabhujoseph], could you please consider these suggestions as well? Most changes in the latest patch LGTM, a minor suggestion is to change root element name of BulkActivitiesInfo from "schedulerActivities" to "bulkActivities", some related places like ActivitiesTestUtils#FN_SCHEDULER_BULK_ACT_ROOT should be changed as well. > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149030#comment-17149030 ] Tao Yang commented on YARN-10319: - Thanks for updating the patch and sorry for missing the last comment, I will take a look at the latest patch later today. > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143416#comment-17143416 ] Tao Yang commented on YARN-10319: - Thanks [~prabhujoseph] for this improvement. I agree that it may be helpful for single-node lookup mechanism, with which users can get all-nodes activities in multiple scheduling cycles at once for better debugging. Some comments about the patch: * Is it better to rename "bulkactivities" (REST API name) to "bulk-activities"? * SchedulerActivitiesInfo is similar to ActivitiesInfo which also means scheduler activities info, can we rename it to BulkActivitiesInfo? * To keep consistence, we can also rename RMWebServices#getLastNActivities to RMWebServices#getBulkActivities. * ActivitiesManager#recordCount can be affected by both activities and bulk-activities REST APIs, we can use `recordCount.compareAndSet(0, 1)` instead of `recordCount.set(1)` to avoid getting unexpected number of bulk-activities, right? * The fetching approaches of activities and bulk-activities REST API are different (asynchronous or synchronous), I think we should elaborate this in the document. > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8011) TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart fails sometimes in trunk
[ https://issues.apache.org/jira/browse/YARN-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133859#comment-17133859 ] Tao Yang commented on YARN-8011: Thanks [~Jim_Brennan] for the feedback and contribution. The patch for branch-2.10 LGTM, already committed to branch-2.10. Thanks. > TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart > fails sometimes in trunk > --- > > Key: YARN-8011 > URL: https://issues.apache.org/jira/browse/YARN-8011 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Fix For: 3.1.0 > > Attachments: YARN-8011-branch-2.10.001.patch, YARN-8011.001.patch, > YARN-8011.002.patch > > > TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart > often pass, but the following errors sometimes occur: > {noformat} > java.lang.AssertionError: > Expected :15360 > Actual :14336 > > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.verifyMetrics(TestOpportunisticContainerAllocatorAMService.java:732) > at > org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.testContainerPromoteAndDemoteBeforeContainerStart(TestOpportunisticContainerAllocatorAMService.java:330) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > {noformat} > > This problem is caused by that deducting resource is a little behind the > assertion. To solve this problem, It can sleep a while before this assertion > as below. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8011) TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart fails sometimes in trunk
[ https://issues.apache.org/jira/browse/YARN-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-8011: --- Fix Version/s: 2.10.1 > TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart > fails sometimes in trunk > --- > > Key: YARN-8011 > URL: https://issues.apache.org/jira/browse/YARN-8011 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Fix For: 3.1.0, 2.10.1 > > Attachments: YARN-8011-branch-2.10.001.patch, YARN-8011.001.patch, > YARN-8011.002.patch > > > TestOpportunisticContainerAllocatorAMService#testContainerPromoteAndDemoteBeforeContainerStart > often pass, but the following errors sometimes occur: > {noformat} > java.lang.AssertionError: > Expected :15360 > Actual :14336 > > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.verifyMetrics(TestOpportunisticContainerAllocatorAMService.java:732) > at > org.apache.hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService.testContainerPromoteAndDemoteBeforeContainerStart(TestOpportunisticContainerAllocatorAMService.java:330) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) > {noformat} > > This problem is caused by that deducting resource is a little behind the > assertion. To solve this problem, It can sleep a while before this assertion > as below. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133848#comment-17133848 ] Tao Yang commented on YARN-10293: - I think this patch is fine enough, and would like to commit the latest patch if there is no objection in a few hours. Thanks [~prabhujoseph] for this contribution. > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch, YARN-10293-004.patch, YARN-10293-005.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Trying to fulfill reservation for application application_1590046667304_0005 > on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > assignContainers: partition= #applications=1 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > 2020-05-21 12:13:33,243 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: >
[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129091#comment-17129091 ] Tao Yang commented on YARN-10293: - Thanks [~prabhujoseph] for updating the patch. LGTM now, [~wangda], do you have some comments or suggestions about the patch? > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch, YARN-10293-004.patch, YARN-10293-005.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Trying to fulfill reservation for application application_1590046667304_0005 > on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > assignContainers: partition= #applications=1 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > 2020-05-21 12:13:33,243 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Allocation proposal accepted > {code}
[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127867#comment-17127867 ] Tao Yang commented on YARN-10293: - Thanks [~prabhujoseph] for updating the patch. Another concern in UT is that could you finish the UT without updating the controlling access for SchedulerNode#addUnallocatedResource? I think directly calling SchedulerNode#addUnallocatedResource in UT is hard to understand. BTW, please fix the remaining check-style warning, UT failures seem unrelated to this patch. > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch, YARN-10293-004.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Trying to fulfill reservation for application application_1590046667304_0005 > on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > assignContainers: partition= #applications=1 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: >
[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126407#comment-17126407 ] Tao Yang commented on YARN-10293: - Thanks [~prabhujoseph] for this effort. I'm fine, please go ahead. {quote} Yes sure, YARN-9598 addresses many other issues. Will check how to contribute to the same and address any other optimization required. {quote} Good to hear that, Thanks. For the patch, overall it looks good, some suggestions about the UT: * In TestCapacitySchedulerMultiNodes#testExcessReservationWillBeUnreserved, this patch changes the behavior of second-to-last allocation and make last allocation unnecessary, can you remove line 261 to line 267 to make it more clear? {code} Assert.assertEquals(1, schedulerApp1.getLiveContainers().size()); Assert.assertEquals(0, schedulerApp1.getReservedContainers().size()); -Assert.assertEquals(1, schedulerApp2.getLiveContainers().size()); - -// Trigger scheduling to allocate a container on nm1 for app2. -cs.handle(new NodeUpdateSchedulerEvent(rmNode1)); -Assert.assertNull(cs.getNode(nm1.getNodeId()).getReservedContainer()); -Assert.assertEquals(1, schedulerApp1.getLiveContainers().size()); -Assert.assertEquals(0, schedulerApp1.getReservedContainers().size()); Assert.assertEquals(2, schedulerApp2.getLiveContainers().size()); Assert.assertEquals(7 * GB, cs.getNode(nm1.getNodeId()).getAllocatedResource().getMemorySize()); Assert.assertEquals(12 * GB, cs.getRootQueue().getQueueResourceUsage().getUsed().getMemorySize()); {code} * Can we remove the TestCapacitySchedulerMultiNodesWithPreemption#getFiCaSchedulerApp method and get the scheduler app via calling CapacityScheduler#getApplicationAttempt ? * There are lots of while clauses, Thread#sleep callings and async-thread creation for checking states in TestCapacitySchedulerMultiNodesWithPreemption#testAllocationOfReservationFromOtherNode, could you please calling GenericTestUtils#waitFor, MockRM#waitForState etc. to simplify it? > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when
[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124527#comment-17124527 ] Tao Yang commented on YARN-10293: - Thanks [~wangda] for your confirmation. I think the proposed change can solve the problem for heartbeat-driven scheduling but not async scheduling, since it may still keep in a loop that chooses the first one of candidate nodes then do re-reservation as mentioned in YARN-9598. However, if what we want for this issue is just to fix this problem for heartbeat-driven scenarios, and later will have a more complete solution, the change is fine to me for now. In our internal version, we already remove this check to support allocating OPPORTUNISTIC containers in the main scheduling process. > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Trying to fulfill reservation for application application_1590046667304_0005 > on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > assignContainers: partition= #applications=1 > 2020-05-21 12:13:33,242 INFO >
[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123686#comment-17123686 ] Tao Yang commented on YARN-10293: - Hi, [~prabhujoseph], [~wangda] This problem is similar to YARN-9598, which was in dispute so there's no further progress. In my opinion, YARN-9598 and this issue may just parts of reservation problems, it's better to refactor the reservation logic again to compatible with the scheduling framework which has been updated a lot by global scheduler, especially for multi-nodes lookup mechanism. At least we should rethink all referenced logic in scheduling cycle to have a more complete solution for current reservation. Thoughts? > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Trying to fulfill reservation for application application_1590046667304_0005 > on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > assignContainers: partition= #applications=1 > 2020-05-21 12:13:33,242 INFO >
[jira] [Commented] (YARN-9050) [Umbrella] Usability improvements for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059185#comment-17059185 ] Tao Yang commented on YARN-9050: Thanks [~cheersyang] very much for your help and patience, very appreciate! > [Umbrella] Usability improvements for scheduler activities > -- > > Key: YARN-9050 > URL: https://issues.apache.org/jira/browse/YARN-9050 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: image-2018-11-23-16-46-38-138.png > > > We have did some usability improvements for scheduler activities based on > YARN3.1 in our cluster as follows: > 1. Not available for multi-thread asynchronous scheduling. App and node > activities maybe confused when multiple scheduling threads record activities > of different allocation processes in the same variables like appsAllocation > and recordingNodesAllocation in ActivitiesManager. I think these variables > should be thread-local to make activities clear among multiple threads. > 2. Incomplete activities for multi-node lookup mechanism, since > ActivitiesLogger will skip recording through \{{if (node == null || > activitiesManager == null) }} when node is null which represents this > allocation is for multi-nodes. We need support recording activities for > multi-node lookup mechanism. > 3. Current app activities can not meet requirements of diagnostics, for > example, we can know that node doesn't match request but hard to know why, > especially when using placement constraints, it's difficult to make a > detailed diagnosis manually. So I propose to improve the diagnoses of > activities, add diagnosis for placement constraints check, update > insufficient resource diagnosis with detailed info (like 'insufficient > resource names:[memory-mb]') and so on. > 4. Add more useful fields for app activities, in some scenarios we need to > distinguish different requests but can't locate requests based on app > activities info, there are some other fields can help to filter what we want > such as allocation tags. We have added containerPriority, allocationRequestId > and allocationTags fields in AppAllocation. > 5. Filter app activities by key fields, sometimes the results of app > activities is massive, it's hard to find what we want. We have support filter > by allocation-tags to meet requirements from some apps, more over, we can > take container-priority and allocation-request-id as candidates if necessary. > 6. Aggregate app activities by diagnoses. For a single allocation process, > activities still can be massive in a large cluster, we frequently want to > know why request can't be allocated in cluster, it's hard to check every node > manually in a large cluster, so that aggregation for app activities by > diagnoses is necessary to solve this trouble. We have added groupingType > parameter for app-activities REST API for this, supports grouping by > diagnostics. > I think we can have a discuss about these points, useful improvements which > can be accepted will be added into the patch. Thanks. > Running design doc is attached > [here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10192) CapacityScheduler stuck in loop rejecting allocation proposals
[ https://issues.apache.org/jira/browse/YARN-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057537#comment-17057537 ] Tao Yang commented on YARN-10192: - Hi, [~wangda]. I'm not sure about this issue, we have found some issues when async-scheduling is enabled, this issue seemsnot in the async-scheduling mode according to the logs above and it's hard to found the root cause from these logs, I think more logs are needed for further analyzing via dynamically updating log level of some important classes (such as org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp) to DEBUG. BTW, scheduler activities is more useful for debugging but only applicable after version-3.3. > CapacityScheduler stuck in loop rejecting allocation proposals > -- > > Key: YARN-10192 > URL: https://issues.apache.org/jira/browse/YARN-10192 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.10.0 >Reporter: Jonathan Hung >Priority: Major > > On a 2.10.0 cluster, we observed containers were being scheduled very slowly. > Based on logs, it seems to reject a bunch of allocation proposals, then > accept a bunch of reserved containers, but very few containers are actually > getting allocated: > {noformat} > 2020-03-10 06:31:48,965 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.30113637 > absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster= > 2020-03-10 06:31:48,965 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > 2020-03-10 06:31:48,965 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1582403122262_15460_01 > container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu > 2020-03-10 06:31:48,965 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=misc usedCapacity=0.0031771248 > absoluteUsedCapacity=3.1771246E-4 used= > cluster= > 2020-03-10 06:31:48,965 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.30113637 > absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster= > 2020-03-10 06:31:48,965 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > 2020-03-10 06:31:48,968 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1582403122262_15460_01 > container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu > 2020-03-10 06:31:48,968 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=misc usedCapacity=0.0031771248 > absoluteUsedCapacity=3.1771246E-4 used= > cluster= > 2020-03-10 06:31:48,968 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.30113637 > absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster= > 2020-03-10 06:31:48,968 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > 2020-03-10 06:31:48,977 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1582403122262_15460_01 > container=null queue=misc_default clusterResource= vCores:34413, yarn.io/gpu: 1241> type=OFF_SWITCH requestedPartition=cpu > 2020-03-10 06:31:48,977 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=misc usedCapacity=0.0031771248 > absoluteUsedCapacity=3.1771246E-4 used= > cluster= > 2020-03-10 06:31:48,977 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.30113637 > absoluteUsedCapacity=0.30113637 used= yarn.io/gpu: 265> cluster= > 2020-03-10 06:31:48,977 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > 2020-03-10 06:31:48,981 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1582403122262_15460_01 > container=null queue=misc_default
[jira] [Commented] (YARN-10151) Disable Capacity Scheduler's move app between queue functionality
[ https://issues.apache.org/jira/browse/YARN-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039665#comment-17039665 ] Tao Yang commented on YARN-10151: - Hi, [~leftnoteasy] FYI, a related issue which can make that happen has been solved in YARN-9838. > Disable Capacity Scheduler's move app between queue functionality > - > > Key: YARN-10151 > URL: https://issues.apache.org/jira/browse/YARN-10151 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Priority: Critical > > Saw this happened in many clusters: Capacity Scheduler cannot work correctly > with the move app between queue features. It will cause weird JMX issue, > resource accounting issue, etc. In a lot of causes it will cause RM > completely hung and available resource became negative, nothing can be > allocated after that. We should turn off CapacityScheduler's move app between > queue feature. (see: > {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#moveApplication}} > ) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029648#comment-17029648 ] Tao Yang commented on YARN-9567: Thanks [~cheersyang] for the review. It seems that wrong file was taken as the new patch, some information in console output: YARN-9567 patch is being downloaded at Mon Feb 3 20:38:28 UTC 2020 from https://issues.apache.org/jira/secure/attachment/12991343/scheduler-activities-example.png -> Downloaded Attached v4 patch (same as v3 patch) to re-trigger the jenkins job. > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > YARN-9567.003.patch, YARN-9567.004.patch, app-activities-example.png, > image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, > image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, > no_diagnostic_at_first.png, scheduler-activities-example.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9567: --- Attachment: YARN-9567.004.patch > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > YARN-9567.003.patch, YARN-9567.004.patch, app-activities-example.png, > image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, > image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, > no_diagnostic_at_first.png, scheduler-activities-example.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019278#comment-17019278 ] Tao Yang commented on YARN-9538: Thanks [~cheersyang] for the review. Attached v4 patch to fix failures in Jenkins. > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch, YARN-9538.002.patch, > YARN-9538.003.patch, YARN-9538.004.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9538: --- Attachment: YARN-9538.004.patch > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch, YARN-9538.002.patch, > YARN-9538.003.patch, YARN-9538.004.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019208#comment-17019208 ] Tao Yang commented on YARN-9567: Thanks [~cheersyang] for the review. I have attached V3 patch with updates: * Enable showing activities info only when CS is enabled. * Support pagination for the activities table, examples: Showing app diagnostics: !app-activities-example.png! Showing scheduler activities (when app diagnostics are not found): !scheduler-activities-example.png! > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > YARN-9567.003.patch, app-activities-example.png, > image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, > image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, > no_diagnostic_at_first.png, scheduler-activities-example.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9567: --- Attachment: scheduler-activities-example.png > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > YARN-9567.003.patch, app-activities-example.png, > image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, > image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, > no_diagnostic_at_first.png, scheduler-activities-example.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9567: --- Attachment: app-activities-example.png > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > YARN-9567.003.patch, app-activities-example.png, > image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, > image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, > no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9567: --- Attachment: YARN-9567.003.patch > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > YARN-9567.003.patch, image-2019-06-04-17-29-29-368.png, > image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, > image-2019-06-14-11-21-41-066.png, no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9567: --- Attachment: (was: YARN-9567.003.patch) > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, > image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, > no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9567: --- Attachment: YARN-9567.003.patch > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > YARN-9567.003.patch, image-2019-06-04-17-29-29-368.png, > image-2019-06-04-17-31-31-820.png, image-2019-06-04-17-58-11-886.png, > image-2019-06-14-11-21-41-066.png, no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7007) NPE in RM while using YarnClient.getApplications()
[ https://issues.apache.org/jira/browse/YARN-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012615#comment-17012615 ] Tao Yang commented on YARN-7007: Already cherry-picked this fix to branch-2.8 > NPE in RM while using YarnClient.getApplications() > -- > > Key: YARN-7007 > URL: https://issues.apache.org/jira/browse/YARN-7007 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Lingfeng Su >Assignee: Lingfeng Su >Priority: Major > Labels: patch > Fix For: 2.9.0, 3.0.0-beta1, 2.8.6 > > Attachments: YARN-7007.001.patch > > > {code:java} > java.lang.NullPointerException: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:118) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:857) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:629) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.verifyAndCreateAppReport(ClientRMService.java:972) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:898) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:734) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplications(ApplicationClientProtocolPBServiceImpl.java:239) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:441) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2198) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2196) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:254) > at sun.reflect.GeneratedMethodAccessor731.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy161.getApplications(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:479) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:456) > {code} > When I use YarnClient.getApplications() to get all applications of RM, > Occasionally, it throw a NPE problem. > {code:java} > RMAppAttempt currentAttempt = rmContext.getRMApps() >.get(attemptId.getApplicationId()).getCurrentAppAttempt(); > {code} > if the application id is not in ConcurrentMap > getRMApps(), it may throw NPE problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7007) NPE in RM while using YarnClient.getApplications()
[ https://issues.apache.org/jira/browse/YARN-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7007: --- Fix Version/s: 2.8.6 > NPE in RM while using YarnClient.getApplications() > -- > > Key: YARN-7007 > URL: https://issues.apache.org/jira/browse/YARN-7007 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Lingfeng Su >Assignee: Lingfeng Su >Priority: Major > Labels: patch > Fix For: 2.9.0, 3.0.0-beta1, 2.8.6 > > Attachments: YARN-7007.001.patch > > > {code:java} > java.lang.NullPointerException: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:118) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:857) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:629) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.verifyAndCreateAppReport(ClientRMService.java:972) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:898) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:734) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplications(ApplicationClientProtocolPBServiceImpl.java:239) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:441) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2198) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2196) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:254) > at sun.reflect.GeneratedMethodAccessor731.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy161.getApplications(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:479) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:456) > {code} > When I use YarnClient.getApplications() to get all applications of RM, > Occasionally, it throw a NPE problem. > {code:java} > RMAppAttempt currentAttempt = rmContext.getRMApps() >.get(attemptId.getApplicationId()).getCurrentAppAttempt(); > {code} > if the application id is not in ConcurrentMap > getRMApps(), it may throw NPE problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7007) NPE in RM while using YarnClient.getApplications()
[ https://issues.apache.org/jira/browse/YARN-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012554#comment-17012554 ] Tao Yang commented on YARN-7007: [~fly_in_gis], thanks for the feedback, I will cherry-pick this fix to 2.8 later. > NPE in RM while using YarnClient.getApplications() > -- > > Key: YARN-7007 > URL: https://issues.apache.org/jira/browse/YARN-7007 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Lingfeng Su >Assignee: Lingfeng Su >Priority: Major > Labels: patch > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: YARN-7007.001.patch > > > {code:java} > java.lang.NullPointerException: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:118) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:857) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:629) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.verifyAndCreateAppReport(ClientRMService.java:972) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:898) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:734) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplications(ApplicationClientProtocolPBServiceImpl.java:239) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:441) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2198) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2196) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:254) > at sun.reflect.GeneratedMethodAccessor731.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy161.getApplications(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:479) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:456) > {code} > When I use YarnClient.getApplications() to get all applications of RM, > Occasionally, it throw a NPE problem. > {code:java} > RMAppAttempt currentAttempt = rmContext.getRMApps() >.get(attemptId.getApplicationId()).getCurrentAppAttempt(); > {code} > if the application id is not in ConcurrentMap > getRMApps(), it may throw NPE problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011789#comment-17011789 ] Tao Yang commented on YARN-9567: Thanks [~cheersyang] for the review. {quote} 1. since this is a CS only feature, pls make sure nothing breaks when FS is enabled {quote} Yes, it should show this table only when CS is enabled, will updated in next patch. {quote} 2. does the table support paging? {quote} Not yet, I think it's not a strong requirement which only used for debugging, we can rarely got a long table about that, and even if we have, it may have a minor impact for the UI, right? > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9567.001.patch, YARN-9567.002.patch, > image-2019-06-04-17-29-29-368.png, image-2019-06-04-17-31-31-820.png, > image-2019-06-04-17-58-11-886.png, image-2019-06-14-11-21-41-066.png, > no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9538: --- Attachment: YARN-9538.003.patch > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch, YARN-9538.002.patch, > YARN-9538.003.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011781#comment-17011781 ] Tao Yang commented on YARN-9538: Attached v3 patch in which most comments are addressed, updates need more discussion are as follows: CS 1. the table of content can be auto-generated by Doxia Macros via defining "MACRO\{toc|fromDepth=0|toDepth=3}", so there's nothing we can do for this. I have updated other modifications, please help to review them as well, thanks: // Activities Scheduling activities are activity messages used for debugging on some critical scheduling path, they can be recorded and exposed via RESTful API with minor impact on the scheduler performance. // Scheduler Activities Scheduler activities include useful scheduling info in a scheduling cycle, which illustrate how the scheduler allocates a container. Scheduler activities REST API (`http://rm-http-address:port/ws/v1/cluster/scheduler/activities`) provides a way to enable recording scheduler activities and fetch them from cache.To eliminate the performance impact, scheduler automatically disables recording activities at the end of a scheduling cycle, you can query the RESTful API again to get the latest scheduler activities. // Application Activities Application activities include useful scheduling info for a specified application, which illustrate how the requirements are satisfied or just skipped. Application activities REST API (`http://rm-http-address:port/ws/v1/cluster/scheduler/app-activities/\{appid}`) provides a way to enable recording application activities for a specified application within a few seconds or fetch historical application activities from cache, available actions which include "refresh" and "get" can be specified by the "actions" parameter: RM 1. +The scheduler activities API currently supports Capacity Scheduler and provides a way to get scheduler activities in a single scheduling process, it will trigger recording scheduler activities in next scheduling process and then take last required scheduler activities from cache as the response. The response have hierarchical structure with multiple levels and important scheduling details which are organized by the sequence of scheduling process: -> The scheduler activities Restful API {color:#FF}is available if you are using capacity scheduler and{color} can fetch scheduler activities info recorded in a scheduling cycle. The API returns a message that includes important scheduling activities info {color:#FF}which has a hierarchical layout with following fields:{color} 7. + Application activities include useful scheduling info for a specified application, the response have hierarchical structure with multiple levels: -> Application activities Restful API {color:#FF}is available if you are using capacity scheduler and can fetch useful scheduling info for a specified application{color}, the response has a hierarchical layout with following fields: 8. * *AppActivities* - AppActivities are root structure of application activities within basic information. -> is the root element? Yes, updated: AppActivities are root {color:#FF}element{color} ... 9. +* *Applications* - Allocations are allocation attempts at app level queried from the cache. -> shouldn't here be applications? Right, updated: +* {color:#FF}*Allocations*{color} - Allocations ... > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch, YARN-9538.002.patch, > YARN-9538.003.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011505#comment-17011505 ] Tao Yang commented on YARN-9538: Thanks [~cheersyang] for finding out mistakes and providing better descriptions, I'll fix them as soon as possible. > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch, YARN-9538.002.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011387#comment-17011387 ] Tao Yang commented on YARN-9538: Attached v2 patch which have been checked via hugo in my local test environment. > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch, YARN-9538.002.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9538: --- Attachment: YARN-9538.002.patch > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch, YARN-9538.002.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9050) [Umbrella] Usability improvements for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010339#comment-17010339 ] Tao Yang commented on YARN-9050: Glad to hear that 3.3.0 release is on the way and thanks for reminding me. The remaining issues are almost ready and only need some reviews, they can be done before this release, thanks. > [Umbrella] Usability improvements for scheduler activities > -- > > Key: YARN-9050 > URL: https://issues.apache.org/jira/browse/YARN-9050 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: image-2018-11-23-16-46-38-138.png > > > We have did some usability improvements for scheduler activities based on > YARN3.1 in our cluster as follows: > 1. Not available for multi-thread asynchronous scheduling. App and node > activities maybe confused when multiple scheduling threads record activities > of different allocation processes in the same variables like appsAllocation > and recordingNodesAllocation in ActivitiesManager. I think these variables > should be thread-local to make activities clear among multiple threads. > 2. Incomplete activities for multi-node lookup mechanism, since > ActivitiesLogger will skip recording through \{{if (node == null || > activitiesManager == null) }} when node is null which represents this > allocation is for multi-nodes. We need support recording activities for > multi-node lookup mechanism. > 3. Current app activities can not meet requirements of diagnostics, for > example, we can know that node doesn't match request but hard to know why, > especially when using placement constraints, it's difficult to make a > detailed diagnosis manually. So I propose to improve the diagnoses of > activities, add diagnosis for placement constraints check, update > insufficient resource diagnosis with detailed info (like 'insufficient > resource names:[memory-mb]') and so on. > 4. Add more useful fields for app activities, in some scenarios we need to > distinguish different requests but can't locate requests based on app > activities info, there are some other fields can help to filter what we want > such as allocation tags. We have added containerPriority, allocationRequestId > and allocationTags fields in AppAllocation. > 5. Filter app activities by key fields, sometimes the results of app > activities is massive, it's hard to find what we want. We have support filter > by allocation-tags to meet requirements from some apps, more over, we can > take container-priority and allocation-request-id as candidates if necessary. > 6. Aggregate app activities by diagnoses. For a single allocation process, > activities still can be massive in a large cluster, we frequently want to > know why request can't be allocated in cluster, it's hard to check every node > manually in a large cluster, so that aggregation for app activities by > diagnoses is necessary to solve this trouble. We have added groupingType > parameter for app-activities REST API for this, supports grouping by > diagnostics. > I think we can have a discuss about these points, useful improvements which > can be accepted will be added into the patch. Thanks. > Running design doc is attached > [here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store
[ https://issues.apache.org/jira/browse/YARN-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-10059: Attachment: YARN-10059.001.patch > Final states of failed-to-localize containers are not recorded in NM state > store > > > Key: YARN-10059 > URL: https://issues.apache.org/jira/browse/YARN-10059 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-10059.001.patch > > > Currently we found an issue that many localizers of completed containers were > launched and exhausted memory/cpu of that machine after NM restarted, these > containers were all failed and completed when localizing on a non-existed > local directory which is caused by another problem, but their final states > weren't recorded in NM state store. > The process flow of a fail-to-localize container is as follow: > {noformat} > ResourceLocalizationService$LocalizerRunner#run > -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> > LOCALIZATION_FAILED upon RESOURCE_FAILED > dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES > -> ResourceLocalizationService#handleCleanupContainerResources handle > CLEANUP_CONTAINER_RESOURCES > dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP > -> ContainerImpl$LocalizationFailedToDoneTransition#transition > handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP > {noformat} > There's no update for state store in this flow now, which is required to > avoid unnecessary localizations after NM restarts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store
[ https://issues.apache.org/jira/browse/YARN-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-10059: Attachment: (was: YARN-10059.001.patch) > Final states of failed-to-localize containers are not recorded in NM state > store > > > Key: YARN-10059 > URL: https://issues.apache.org/jira/browse/YARN-10059 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > > Currently we found an issue that many localizers of completed containers were > launched and exhausted memory/cpu of that machine after NM restarted, these > containers were all failed and completed when localizing on a non-existed > local directory which is caused by another problem, but their final states > weren't recorded in NM state store. > The process flow of a fail-to-localize container is as follow: > {noformat} > ResourceLocalizationService$LocalizerRunner#run > -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> > LOCALIZATION_FAILED upon RESOURCE_FAILED > dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES > -> ResourceLocalizationService#handleCleanupContainerResources handle > CLEANUP_CONTAINER_RESOURCES > dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP > -> ContainerImpl$LocalizationFailedToDoneTransition#transition > handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP > {noformat} > There's no update for state store in this flow now, which is required to > avoid unnecessary localizations after NM restarts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store
[ https://issues.apache.org/jira/browse/YARN-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002700#comment-17002700 ] Tao Yang commented on YARN-10059: - Attached v1 patch for review. > Final states of failed-to-localize containers are not recorded in NM state > store > > > Key: YARN-10059 > URL: https://issues.apache.org/jira/browse/YARN-10059 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-10059.001.patch > > > Currently we found an issue that many localizers of completed containers were > launched and exhausted memory/cpu of that machine after NM restarted, these > containers were all failed and completed when localizing on a non-existed > local directory which is caused by another problem, but their final states > weren't recorded in NM state store. > The process flow of a fail-to-localize container is as follow: > {noformat} > ResourceLocalizationService$LocalizerRunner#run > -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> > LOCALIZATION_FAILED upon RESOURCE_FAILED > dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES > -> ResourceLocalizationService#handleCleanupContainerResources handle > CLEANUP_CONTAINER_RESOURCES > dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP > -> ContainerImpl$LocalizationFailedToDoneTransition#transition > handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP > {noformat} > There's no update for state store in this flow now, which is required to > avoid unnecessary localizations after NM restarts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store
[ https://issues.apache.org/jira/browse/YARN-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-10059: Attachment: YARN-10059.001.patch > Final states of failed-to-localize containers are not recorded in NM state > store > > > Key: YARN-10059 > URL: https://issues.apache.org/jira/browse/YARN-10059 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-10059.001.patch > > > Currently we found an issue that many localizers of completed containers were > launched and exhausted memory/cpu of that machine after NM restarted, these > containers were all failed and completed when localizing on a non-existed > local directory which is caused by another problem, but their final states > weren't recorded in NM state store. > The process flow of a fail-to-localize container is as follow: > {noformat} > ResourceLocalizationService$LocalizerRunner#run > -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> > LOCALIZATION_FAILED upon RESOURCE_FAILED > dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES > -> ResourceLocalizationService#handleCleanupContainerResources handle > CLEANUP_CONTAINER_RESOURCES > dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP > -> ContainerImpl$LocalizationFailedToDoneTransition#transition > handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP > {noformat} > There's no update for state store in this flow now, which is required to > avoid unnecessary localizations after NM restarts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store
Tao Yang created YARN-10059: --- Summary: Final states of failed-to-localize containers are not recorded in NM state store Key: YARN-10059 URL: https://issues.apache.org/jira/browse/YARN-10059 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Tao Yang Assignee: Tao Yang Currently we found an issue that many localizers of completed containers were launched and exhausted memory/cpu of that machine after NM restarted, these containers were all failed and completed when localizing on a non-existed local directory which is caused by another problem, but their final states weren't recorded in NM state store. The process flow of a fail-to-localize container is as follow: {noformat} ResourceLocalizationService$LocalizerRunner#run -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> LOCALIZATION_FAILED upon RESOURCE_FAILED dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES -> ResourceLocalizationService#handleCleanupContainerResources handle CLEANUP_CONTAINER_RESOURCES dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP -> ContainerImpl$LocalizationFailedToDoneTransition#transition handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP {noformat} There's no update for state store in this flow now, which is required to avoid unnecessary localizations after NM restarts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9838) Fix resource inconsistency for queues when moving app with reserved container to another queue
[ https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9838: --- Fix Version/s: 3.1.4 3.2.2 2.9.3 3.3.0 > Fix resource inconsistency for queues when moving app with reserved container > to another queue > -- > > Key: YARN-9838 > URL: https://issues.apache.org/jira/browse/YARN-9838 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.3 >Reporter: jiulongzhu >Assignee: jiulongzhu >Priority: Critical > Labels: patch > Fix For: 3.3.0, 2.9.3, 3.2.2, 3.1.4 > > Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, > YARN-9838.0001.patch, YARN-9838.0002.patch > > > In some clusters of ours, we are seeing "Used Resource","Used > Capacity","Absolute Used Capacity" and "Num Container" is positive or > negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In > extreme cases, apps couldn't be submitted to the queue that is actually idle > but the "Used Resource" is far more than zero, just like "Container Leak". > Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used > Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and > "Num Container" use the "numContainer" value kept by LeafQueue.And > AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will > change the state value of "numContainer" and "Used". Secondly, by comparing > the values numContainer and ResourceUsageByLabel and QueueMetrics > changed(#allocateContainer and #releaseContainer) logic of applications with > and without "movetoqueue",i found that moving the reservedContainers didn't > modify the "numContainer" value in AbstractCSQueue and "used" value in > ResourceUsage when the application was moved from a queue to another queue. > The metric values changed logic of reservedContainers are allocated, > and moved from $FROM queue to $TO queue, and released.The degree of increase > and decrease is not conservative, the Resource allocated from $FROM queue and > release to $TO queue. > ||move reversedContainer||allocate||movetoqueue||release|| > |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the > same,$TO queue stay the same{color}|decrease in $TO queue| > |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM > queue stay the same,$TO queue stay the same{color}|decrease in $TO queue | > |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in > $TO queue|decrease in $TO queue| > The metric values changed logic of allocatedContainer(allocated, > acquired, running) are allocated, and movetoqueue, and released are > absolutely conservative. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9838) Fix resource inconsistency for queues when moving app with reserved container to another queue
[ https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9838: --- Summary: Fix resource inconsistency for queues when moving app with reserved container to another queue (was: Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metrics error ) > Fix resource inconsistency for queues when moving app with reserved container > to another queue > -- > > Key: YARN-9838 > URL: https://issues.apache.org/jira/browse/YARN-9838 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.3 >Reporter: jiulongzhu >Assignee: jiulongzhu >Priority: Critical > Labels: patch > Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, > YARN-9838.0001.patch, YARN-9838.0002.patch > > > In some clusters of ours, we are seeing "Used Resource","Used > Capacity","Absolute Used Capacity" and "Num Container" is positive or > negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In > extreme cases, apps couldn't be submitted to the queue that is actually idle > but the "Used Resource" is far more than zero, just like "Container Leak". > Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used > Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and > "Num Container" use the "numContainer" value kept by LeafQueue.And > AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will > change the state value of "numContainer" and "Used". Secondly, by comparing > the values numContainer and ResourceUsageByLabel and QueueMetrics > changed(#allocateContainer and #releaseContainer) logic of applications with > and without "movetoqueue",i found that moving the reservedContainers didn't > modify the "numContainer" value in AbstractCSQueue and "used" value in > ResourceUsage when the application was moved from a queue to another queue. > The metric values changed logic of reservedContainers are allocated, > and moved from $FROM queue to $TO queue, and released.The degree of increase > and decrease is not conservative, the Resource allocated from $FROM queue and > release to $TO queue. > ||move reversedContainer||allocate||movetoqueue||release|| > |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the > same,$TO queue stay the same{color}|decrease in $TO queue| > |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM > queue stay the same,$TO queue stay the same{color}|decrease in $TO queue | > |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in > $TO queue|decrease in $TO queue| > The metric values changed logic of allocatedContainer(allocated, > acquired, running) are allocated, and movetoqueue, and released are > absolutely conservative. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9635) Nodes page displayed duplicate nodes
[ https://issues.apache.org/jira/browse/YARN-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974098#comment-16974098 ] Tao Yang commented on YARN-9635: Hi, [~jiwq]. I think the description of conf in NodeManager.md is not enough yet, we should add some details about this change like from which version and why. [~sunilg], any thoughts about the new patch? > Nodes page displayed duplicate nodes > > > Key: YARN-9635 > URL: https://issues.apache.org/jira/browse/YARN-9635 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 3.2.0 >Reporter: Wanqiang Ji >Assignee: Wanqiang Ji >Priority: Major > Attachments: UI2-nodes.jpg, YARN-9635.001.patch, YARN-9635.002.patch > > > Steps: > * shutdown nodes > * start nodes > Nodes Page: > !UI2-nodes.jpg! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9958) Remove the invalid lock in ContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974079#comment-16974079 ] Tao Yang commented on YARN-9958: Thanks [~jiwq] for this improvement. Patch LGTM, the related r/w lock only work for ContainerExecutor#pidFiles which is a concurrent hash map and no need to be guaranteed by additional lock. I will commit this a few days later if no further comments. > Remove the invalid lock in ContainerExecutor > > > Key: YARN-9958 > URL: https://issues.apache.org/jira/browse/YARN-9958 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wanqiang Ji >Assignee: Wanqiang Ji >Priority: Major > > ContainerExecutor has ReadLock and WriteLock. These used to call get/put > method of ConcurrentMap. Due to the ConcurrentMap providing thread safety and > atomicity guarantees, so we can remove the lock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957845#comment-16957845 ] Tao Yang edited comment on YARN-7621 at 10/23/19 12:51 PM: --- Hi, [~cane]. Sorry for the late reply. It makes perfect sense for me to support duplicate queue names, as [~wilfreds] mentioned, there's more work to do for that. I'm afraid of having no time to work on this recently, please feel free to take over this issue if you want, Thanks. was (Author: tao yang): Hi, [~cane]. Sorry for the late reply. It's make perfect sense for me to support duplicate queue names, as [~wilfreds] mentioned, there's more work to do for that. I'm afraid of having no time to work on this recently, please feel free to take over this issue if you want, Thanks. > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: fs2cs > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957845#comment-16957845 ] Tao Yang commented on YARN-7621: Hi, [~cane]. Sorry for the late reply. It's make perfect sense for me to support duplicate queue names, as [~wilfreds] mentioned, there's more work to do for that. I'm afraid of having no time to work on this recently, please feel free to take over this issue, Thanks. > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: fs2cs > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957845#comment-16957845 ] Tao Yang edited comment on YARN-7621 at 10/23/19 12:48 PM: --- Hi, [~cane]. Sorry for the late reply. It's make perfect sense for me to support duplicate queue names, as [~wilfreds] mentioned, there's more work to do for that. I'm afraid of having no time to work on this recently, please feel free to take over this issue if you want, Thanks. was (Author: tao yang): Hi, [~cane]. Sorry for the late reply. It's make perfect sense for me to support duplicate queue names, as [~wilfreds] mentioned, there's more work to do for that. I'm afraid of having no time to work on this recently, please feel free to take over this issue, Thanks. > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: fs2cs > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile
[ https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952049#comment-16952049 ] Tao Yang commented on YARN-8737: Thanks [~cheersyang] for the review. Submitted already. > Race condition in ParentQueue when reinitializing and sorting child queues in > the meanwhile > --- > > Key: YARN-8737 > URL: https://issues.apache.org/jira/browse/YARN-8737 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.3.0, 2.9.3, 3.2.2, 3.1.4 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Attachments: YARN-8737.001.patch > > > Administrator raised a update for queues through REST API, in RM parent queue > is refreshing child queues through calling ParentQueue#reinitialize, > meanwhile, async-schedule threads is sorting child queues when calling > ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen > and throw exception as follow because TimSort does not handle the concurrent > modification of objects it is sorting: > {noformat} > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at java.util.TimSort.mergeHi(TimSort.java:899) > at java.util.TimSort.mergeAt(TimSort.java:516) > at java.util.TimSort.mergeCollapse(TimSort.java:441) > at java.util.TimSort.sort(TimSort.java:245) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1454) > at java.util.Collections.sort(Collections.java:175) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962) > {noformat} > I think we can add read-lock for > ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the > write-lock will be hold when updating child queues in > ParentQueue#reinitialize. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile
[ https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951552#comment-16951552 ] Tao Yang commented on YARN-8737: Thanks [~Amithsha] for the feedback. Sorry to have forgot this issue for a long time. [~cheersyang] & [~sunilg], Could you please help to review the patch? > Race condition in ParentQueue when reinitializing and sorting child queues in > the meanwhile > --- > > Key: YARN-8737 > URL: https://issues.apache.org/jira/browse/YARN-8737 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Attachments: YARN-8737.001.patch > > > Administrator raised a update for queues through REST API, in RM parent queue > is refreshing child queues through calling ParentQueue#reinitialize, > meanwhile, async-schedule threads is sorting child queues when calling > ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen > and throw exception as follow because TimSort does not handle the concurrent > modification of objects it is sorting: > {noformat} > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at java.util.TimSort.mergeHi(TimSort.java:899) > at java.util.TimSort.mergeAt(TimSort.java:516) > at java.util.TimSort.mergeCollapse(TimSort.java:441) > at java.util.TimSort.sort(TimSort.java:245) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1454) > at java.util.Collections.sort(Collections.java:175) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962) > {noformat} > I think we can add read-lock for > ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the > write-lock will be hold when updating child queues in > ParentQueue#reinitialize. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage
[ https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950671#comment-16950671 ] Tao Yang edited comment on YARN-9838 at 10/14/19 3:17 AM: -- Thanks [~jiulongZhu] for updating the patch. LGTM, +1 for the patch. Last small suggestion is to add a blank line before the new test case, which I can directly update before committing. I will commit this if no further comments from others after a few days. was (Author: tao yang): Thanks [~jiulongZhu] for updating the patch. LGTM, +1 for the patch. Last small suggestion is to add a blank line before the new test case. I will commit this if no further comments from others after a few days. > Using the CapacityScheduler,Apply "movetoqueue" on the application which CS > reserved containers for,will cause "Num Container" and "Used Resource" in > ResourceUsage metrics error > -- > > Key: YARN-9838 > URL: https://issues.apache.org/jira/browse/YARN-9838 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.3 >Reporter: jiulongzhu >Priority: Critical > Labels: patch > Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, > YARN-9838.0001.patch, YARN-9838.0002.patch > > > In some clusters of ours, we are seeing "Used Resource","Used > Capacity","Absolute Used Capacity" and "Num Container" is positive or > negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In > extreme cases, apps couldn't be submitted to the queue that is actually idle > but the "Used Resource" is far more than zero, just like "Container Leak". > Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used > Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and > "Num Container" use the "numContainer" value kept by LeafQueue.And > AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will > change the state value of "numContainer" and "Used". Secondly, by comparing > the values numContainer and ResourceUsageByLabel and QueueMetrics > changed(#allocateContainer and #releaseContainer) logic of applications with > and without "movetoqueue",i found that moving the reservedContainers didn't > modify the "numContainer" value in AbstractCSQueue and "used" value in > ResourceUsage when the application was moved from a queue to another queue. > The metric values changed logic of reservedContainers are allocated, > and moved from $FROM queue to $TO queue, and released.The degree of increase > and decrease is not conservative, the Resource allocated from $FROM queue and > release to $TO queue. > ||move reversedContainer||allocate||movetoqueue||release|| > |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the > same,$TO queue stay the same{color}|decrease in $TO queue| > |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM > queue stay the same,$TO queue stay the same{color}|decrease in $TO queue | > |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in > $TO queue|decrease in $TO queue| > The metric values changed logic of allocatedContainer(allocated, > acquired, running) are allocated, and movetoqueue, and released are > absolutely conservative. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metri
[ https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950671#comment-16950671 ] Tao Yang commented on YARN-9838: Thanks [~jiulongZhu] for updating the patch. LGTM, +1 for the patch. Last small suggestion is to add a blank line before the new test case. I will commit this if no further comments from others after a few days. > Using the CapacityScheduler,Apply "movetoqueue" on the application which CS > reserved containers for,will cause "Num Container" and "Used Resource" in > ResourceUsage metrics error > -- > > Key: YARN-9838 > URL: https://issues.apache.org/jira/browse/YARN-9838 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.3 >Reporter: jiulongzhu >Priority: Critical > Labels: patch > Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, > YARN-9838.0001.patch, YARN-9838.0002.patch > > > In some clusters of ours, we are seeing "Used Resource","Used > Capacity","Absolute Used Capacity" and "Num Container" is positive or > negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In > extreme cases, apps couldn't be submitted to the queue that is actually idle > but the "Used Resource" is far more than zero, just like "Container Leak". > Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used > Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and > "Num Container" use the "numContainer" value kept by LeafQueue.And > AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will > change the state value of "numContainer" and "Used". Secondly, by comparing > the values numContainer and ResourceUsageByLabel and QueueMetrics > changed(#allocateContainer and #releaseContainer) logic of applications with > and without "movetoqueue",i found that moving the reservedContainers didn't > modify the "numContainer" value in AbstractCSQueue and "used" value in > ResourceUsage when the application was moved from a queue to another queue. > The metric values changed logic of reservedContainers are allocated, > and moved from $FROM queue to $TO queue, and released.The degree of increase > and decrease is not conservative, the Resource allocated from $FROM queue and > release to $TO queue. > ||move reversedContainer||allocate||movetoqueue||release|| > |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the > same,$TO queue stay the same{color}|decrease in $TO queue| > |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM > queue stay the same,$TO queue stay the same{color}|decrease in $TO queue | > |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in > $TO queue|decrease in $TO queue| > The metric values changed logic of allocatedContainer(allocated, > acquired, running) are allocated, and movetoqueue, and released are > absolutely conservative. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage
[ https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949330#comment-16949330 ] Tao Yang edited comment on YARN-9838 at 10/11/19 10:02 AM: --- Thanks [~jiulongZhu] for fixing this issue. The patch LGTM in general, some minor suggestions for the patch: * check-style warnings need to be fixed, after that, you can run "dev-support/bin/test-patch /path/to/my.patch" to confirm. * The indentation of updated log need to be adjusted and useless deletion of a blank line should be reverted in LeafQueue. * The annotation "sync ResourceUsageByLabel ResourceUsageByUser and numContainer" can be removed since it seems unnecessary to add details here. * As for UT, you can remove before-fixed block and just keep the correct verification. Moreover, I think it's better to remove the method annotation("//YARN-9838") since we can find the source easily by git, and the annotation style "/\*\* \*/" often used for class or method, it's better to use "//" or "/\* \*/" in the method. was (Author: tao yang): Thanks [~jiulongZhu] for fixing this issue. The patch is LGTM in general, some minor suggestions for the patch: * check-style warnings need to be fixed, after that, you can run "dev-support/bin/test-patch /path/to/my.patch" to confirm. * The indentation of updated log need to be adjusted and useless deletion of a blank line should be reverted in LeafQueue. * The annotation "sync ResourceUsageByLabel ResourceUsageByUser and numContainer" can be removed since it seems unnecessary to add details here. * As for UT, you can remove before-fixed block and just keep the correct verification. Moreover, I think it's better to remove "//YARN-9838" since we can find the source easily by git, and the annotation style "/** */" often used for class or method, it's better to use "//" or "/* */" in the method. > Using the CapacityScheduler,Apply "movetoqueue" on the application which CS > reserved containers for,will cause "Num Container" and "Used Resource" in > ResourceUsage metrics error > -- > > Key: YARN-9838 > URL: https://issues.apache.org/jira/browse/YARN-9838 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.3 >Reporter: jiulongzhu >Priority: Critical > Labels: patch > Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, > YARN-9838.0001.patch > > > In some clusters of ours, we are seeing "Used Resource","Used > Capacity","Absolute Used Capacity" and "Num Container" is positive or > negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In > extreme cases, apps couldn't be submitted to the queue that is actually idle > but the "Used Resource" is far more than zero, just like "Container Leak". > Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used > Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and > "Num Container" use the "numContainer" value kept by LeafQueue.And > AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will > change the state value of "numContainer" and "Used". Secondly, by comparing > the values numContainer and ResourceUsageByLabel and QueueMetrics > changed(#allocateContainer and #releaseContainer) logic of applications with > and without "movetoqueue",i found that moving the reservedContainers didn't > modify the "numContainer" value in AbstractCSQueue and "used" value in > ResourceUsage when the application was moved from a queue to another queue. > The metric values changed logic of reservedContainers are allocated, > and moved from $FROM queue to $TO queue, and released.The degree of increase > and decrease is not conservative, the Resource allocated from $FROM queue and > release to $TO queue. > ||move reversedContainer||allocate||movetoqueue||release|| > |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the > same,$TO queue stay the same{color}|decrease in $TO queue| > |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM > queue stay the same,$TO queue stay the same{color}|decrease in $TO queue | > |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in > $TO queue|decrease in $TO queue| > The metric values changed logic of allocatedContainer(allocated, > acquired, running) are allocated, and movetoqueue, and released are > absolutely conservative. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Updated] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metrics
[ https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9838: --- Issue Type: Bug (was: Improvement) > Using the CapacityScheduler,Apply "movetoqueue" on the application which CS > reserved containers for,will cause "Num Container" and "Used Resource" in > ResourceUsage metrics error > -- > > Key: YARN-9838 > URL: https://issues.apache.org/jira/browse/YARN-9838 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.3 >Reporter: jiulongzhu >Priority: Critical > Labels: patch > Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, > YARN-9838.0001.patch > > > In some clusters of ours, we are seeing "Used Resource","Used > Capacity","Absolute Used Capacity" and "Num Container" is positive or > negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In > extreme cases, apps couldn't be submitted to the queue that is actually idle > but the "Used Resource" is far more than zero, just like "Container Leak". > Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used > Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and > "Num Container" use the "numContainer" value kept by LeafQueue.And > AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will > change the state value of "numContainer" and "Used". Secondly, by comparing > the values numContainer and ResourceUsageByLabel and QueueMetrics > changed(#allocateContainer and #releaseContainer) logic of applications with > and without "movetoqueue",i found that moving the reservedContainers didn't > modify the "numContainer" value in AbstractCSQueue and "used" value in > ResourceUsage when the application was moved from a queue to another queue. > The metric values changed logic of reservedContainers are allocated, > and moved from $FROM queue to $TO queue, and released.The degree of increase > and decrease is not conservative, the Resource allocated from $FROM queue and > release to $TO queue. > ||move reversedContainer||allocate||movetoqueue||release|| > |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the > same,$TO queue stay the same{color}|decrease in $TO queue| > |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM > queue stay the same,$TO queue stay the same{color}|decrease in $TO queue | > |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in > $TO queue|decrease in $TO queue| > The metric values changed logic of allocatedContainer(allocated, > acquired, running) are allocated, and movetoqueue, and released are > absolutely conservative. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metrics
[ https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9838: --- Fix Version/s: (was: 2.7.3) > Using the CapacityScheduler,Apply "movetoqueue" on the application which CS > reserved containers for,will cause "Num Container" and "Used Resource" in > ResourceUsage metrics error > -- > > Key: YARN-9838 > URL: https://issues.apache.org/jira/browse/YARN-9838 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Affects Versions: 2.7.3 >Reporter: jiulongzhu >Priority: Critical > Labels: patch > Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, > YARN-9838.0001.patch > > > In some clusters of ours, we are seeing "Used Resource","Used > Capacity","Absolute Used Capacity" and "Num Container" is positive or > negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In > extreme cases, apps couldn't be submitted to the queue that is actually idle > but the "Used Resource" is far more than zero, just like "Container Leak". > Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used > Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and > "Num Container" use the "numContainer" value kept by LeafQueue.And > AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will > change the state value of "numContainer" and "Used". Secondly, by comparing > the values numContainer and ResourceUsageByLabel and QueueMetrics > changed(#allocateContainer and #releaseContainer) logic of applications with > and without "movetoqueue",i found that moving the reservedContainers didn't > modify the "numContainer" value in AbstractCSQueue and "used" value in > ResourceUsage when the application was moved from a queue to another queue. > The metric values changed logic of reservedContainers are allocated, > and moved from $FROM queue to $TO queue, and released.The degree of increase > and decrease is not conservative, the Resource allocated from $FROM queue and > release to $TO queue. > ||move reversedContainer||allocate||movetoqueue||release|| > |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the > same,$TO queue stay the same{color}|decrease in $TO queue| > |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM > queue stay the same,$TO queue stay the same{color}|decrease in $TO queue | > |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in > $TO queue|decrease in $TO queue| > The metric values changed logic of allocatedContainer(allocated, > acquired, running) are allocated, and movetoqueue, and released are > absolutely conservative. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9838) Using the CapacityScheduler,Apply "movetoqueue" on the application which CS reserved containers for,will cause "Num Container" and "Used Resource" in ResourceUsage metri
[ https://issues.apache.org/jira/browse/YARN-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949330#comment-16949330 ] Tao Yang commented on YARN-9838: Thanks [~jiulongZhu] for fixing this issue. The patch is LGTM in general, some minor suggestions for the patch: * check-style warnings need to be fixed, after that, you can run "dev-support/bin/test-patch /path/to/my.patch" to confirm. * The indentation of updated log need to be adjusted and useless deletion of a blank line should be reverted in LeafQueue. * The annotation "sync ResourceUsageByLabel ResourceUsageByUser and numContainer" can be removed since it seems unnecessary to add details here. * As for UT, you can remove before-fixed block and just keep the correct verification. Moreover, I think it's better to remove "//YARN-9838" since we can find the source easily by git, and the annotation style "/** */" often used for class or method, it's better to use "//" or "/* */" in the method. > Using the CapacityScheduler,Apply "movetoqueue" on the application which CS > reserved containers for,will cause "Num Container" and "Used Resource" in > ResourceUsage metrics error > -- > > Key: YARN-9838 > URL: https://issues.apache.org/jira/browse/YARN-9838 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Affects Versions: 2.7.3 >Reporter: jiulongzhu >Priority: Critical > Labels: patch > Fix For: 2.7.3 > > Attachments: RM_UI_metric_negative.png, RM_UI_metric_positive.png, > YARN-9838.0001.patch > > > In some clusters of ours, we are seeing "Used Resource","Used > Capacity","Absolute Used Capacity" and "Num Container" is positive or > negative when the queue is absolutely idle(no RUNNING, no NEW apps...).In > extreme cases, apps couldn't be submitted to the queue that is actually idle > but the "Used Resource" is far more than zero, just like "Container Leak". > Firstly,I found that "Used Resource","Used Capacity" and "Absolute Used > Capacity" use the "Used" value of ResourceUsage kept by AbstractCSQueue, and > "Num Container" use the "numContainer" value kept by LeafQueue.And > AbstractCSQueue#allocateResource and AbstractCSQueue#releaseResource will > change the state value of "numContainer" and "Used". Secondly, by comparing > the values numContainer and ResourceUsageByLabel and QueueMetrics > changed(#allocateContainer and #releaseContainer) logic of applications with > and without "movetoqueue",i found that moving the reservedContainers didn't > modify the "numContainer" value in AbstractCSQueue and "used" value in > ResourceUsage when the application was moved from a queue to another queue. > The metric values changed logic of reservedContainers are allocated, > and moved from $FROM queue to $TO queue, and released.The degree of increase > and decrease is not conservative, the Resource allocated from $FROM queue and > release to $TO queue. > ||move reversedContainer||allocate||movetoqueue||release|| > |numContainer|increase in $FROM queue|{color:#FF}$FROM queue stay the > same,$TO queue stay the same{color}|decrease in $TO queue| > |ResourceUsageByLabel(USED)|increase in $FROM queue|{color:#FF}$FROM > queue stay the same,$TO queue stay the same{color}|decrease in $TO queue | > |QueueMetrics|increase in $FROM queue|decrease in $FROM queue, increase in > $TO queue|decrease in $TO queue| > The metric values changed logic of allocatedContainer(allocated, > acquired, running) are allocated, and movetoqueue, and released are > absolutely conservative. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8995) Log events info in AsyncDispatcher when event queue size cumulatively reaches a certain number every time.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924672#comment-16924672 ] Tao Yang edited comment on YARN-8995 at 9/7/19 12:33 AM: - Thanks [~jhung] for fixing this problem, sorry for missing changes about logger class in branch-3.1 and branch-3.2. Failures in jenkins report are cased by running environment, unrelated to the patch. Patch LGTM and already tested in my local environment. Committing shortly. was (Author: tao yang): Thanks [~jhung] for fixing this problem, sorry for missing changes about logger class in branch-3.1. Patch LGTM and already tested in my local environment. Committing shortly. > Log events info in AsyncDispatcher when event queue size cumulatively reaches > a certain number every time. > -- > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: TestStreamPerf.java, > YARN-8995-branch-3.1.001.patch.addendum, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, > YARN-8995.014.patch, YARN-8995.015.patch, YARN-8995.016.patch, > image-2019-09-04-15-20-02-914.png > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log events info in AsyncDispatcher when event queue size cumulatively reaches a certain number every time.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924672#comment-16924672 ] Tao Yang commented on YARN-8995: Thanks [~jhung] for fixing this problem, sorry for missing changes about logger class in branch-3.1. Patch LGTM and already tested in my local environment. Committing shortly. > Log events info in AsyncDispatcher when event queue size cumulatively reaches > a certain number every time. > -- > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: TestStreamPerf.java, > YARN-8995-branch-3.1.001.patch.addendum, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, > YARN-8995.014.patch, YARN-8995.015.patch, YARN-8995.016.patch, > image-2019-09-04-15-20-02-914.png > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9817) Fix failing testcases due to not initialized AsyncDispatcher - ArithmeticException: / by zero
[ https://issues.apache.org/jira/browse/YARN-9817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924659#comment-16924659 ] Tao Yang commented on YARN-9817: Thanks [~Prabhu Joseph] for raising this issue. Patch LGTM, committing now... > Fix failing testcases due to not initialized AsyncDispatcher - > ArithmeticException: / by zero > -- > > Key: YARN-9817 > URL: https://issues.apache.org/jira/browse/YARN-9817 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Affects Versions: 3.3.0, 3.2.1, 3.1.3 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9817-001.patch > > > Below testcases failing as Asyncdispatcher throws ArithmeticException: / by > zero > {code} > hadoop.mapreduce.v2.app.TestRuntimeEstimators > hadoop.mapreduce.v2.app.job.impl.TestJobImpl > hadoop.mapreduce.v2.app.TestMRApp > {code} > Error Message: > {code} > [ERROR] testUpdatedNodes(org.apache.hadoop.mapreduce.v2.app.TestMRApp) Time > elapsed: 0.847 s <<< ERROR! > java.lang.ArithmeticException: / by zero > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1015) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:141) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1544) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1263) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at org.apache.hadoop.mapreduce.v2.app.MRApp.submit(MRApp.java:301) > at org.apache.hadoop.mapreduce.v2.app.MRApp.submit(MRApp.java:285) > at > org.apache.hadoop.mapreduce.v2.app.TestMRApp.testUpdatedNodes(TestMRApp.java:223) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > {code} > This happens when AsyncDispatcher is not initialized in the testcases and so > detailsInterval is taken as 0. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9795) ClusterMetrics to include AM allocation delay
[ https://issues.apache.org/jira/browse/YARN-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923891#comment-16923891 ] Tao Yang commented on YARN-9795: +1 for the latest patch. I will commit this if no further comments from others. > ClusterMetrics to include AM allocation delay > - > > Key: YARN-9795 > URL: https://issues.apache.org/jira/browse/YARN-9795 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Minor > Attachments: YARN-9795.001.patch, YARN-9795.002.patch, > YARN-9795.003.patch, YARN-9795.004.patch > > > Add AM container allocation in QueueMetrics to help diagnose performance > issue. This is following > [YARN-2802|https://jira.apache.org/jira/browse/YARN-2802] > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9795) ClusterMetrics to include AM allocation delay
[ https://issues.apache.org/jira/browse/YARN-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923882#comment-16923882 ] Tao Yang commented on YARN-9795: Thanks [~fengnanli] for the update. A small suggestion is to remove null initial value for aMContainerAllocationDelay since it seems redundant. Make sense? > ClusterMetrics to include AM allocation delay > - > > Key: YARN-9795 > URL: https://issues.apache.org/jira/browse/YARN-9795 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Minor > Attachments: YARN-9795.001.patch, YARN-9795.002.patch, > YARN-9795.003.patch > > > Add AM container allocation in QueueMetrics to help diagnose performance > issue. This is following > [YARN-2802|https://jira.apache.org/jira/browse/YARN-2802] > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9795) ClusterMetrics to include AM allocation delay
[ https://issues.apache.org/jira/browse/YARN-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923024#comment-16923024 ] Tao Yang commented on YARN-9795: Thanks [~fengnanli] for this improvement. Patch almost LGTM, IMO, there's no need to set -1 as the initial value of scheduledTime and add the special annotation, 0 should be the proper initial value like other times. And new check-style warnings should be fixed as well. > ClusterMetrics to include AM allocation delay > - > > Key: YARN-9795 > URL: https://issues.apache.org/jira/browse/YARN-9795 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Minor > Attachments: YARN-9795.001.patch > > > Add AM container allocation in QueueMetrics to help diagnose performance > issue. This is following > [YARN-2802|https://jira.apache.org/jira/browse/YARN-2802] > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922996#comment-16922996 ] Tao Yang commented on YARN-8995: Hi, [~zhuqi], I found another place need to be improved. {{ if (qSize % detailsInterval == 0) }} should be updated to {{ if (qSize != 0 && qSize % detailsInterval == 0 && lastEventDetailsQueueSizeLogged != qSize )}}, avoid printing for empty queue and print details redundantly. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, > YARN-8995.014.patch, image-2019-09-04-15-20-02-914.png > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922279#comment-16922279 ] Tao Yang commented on YARN-8995: Confirmed that latest patch should not fail like that. Now the patch LGTM, waiting for feedbacks from [~cheersyang], thanks. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, > YARN-8995.014.patch, image-2019-09-04-15-20-02-914.png > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921981#comment-16921981 ] Tao Yang commented on YARN-8995: Hi, [~zhuqi]. I noticed TestAsyncDispatcher#testPrintDispatcherEventDetails which is added by this patch failed 2 days ago, can you confirm why this happened? Even through it didn't happen again, I'm still afraid it may fail intermittently. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, > YARN-8995.014.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920568#comment-16920568 ] Tao Yang commented on YARN-8995: Thanks [~zhuqi] for the update. Patch LGTM, could you please also fix the remaining check-style warnings? Hi, [~cheersyang], please help to review again, are these changes ok to you? > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9540) TestRMAppTransitions fails intermittently
[ https://issues.apache.org/jira/browse/YARN-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919658#comment-16919658 ] Tao Yang commented on YARN-9540: Thanks [~abmodi], [~adam.antal] for the review and commit. > TestRMAppTransitions fails intermittently > - > > Key: YARN-9540 > URL: https://issues.apache.org/jira/browse/YARN-9540 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, test >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Tao Yang >Priority: Minor > Fix For: 3.3.0 > > Attachments: YARN-9540.001.patch > > > Failed > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished[0] > {code} > Error Message > expected:<1> but was:<0> > Stacktrace > java.lang.AssertionError: expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppCompletedEvent(TestRMAppTransitions.java:1307) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppAfterFinishEvent(TestRMAppTransitions.java:1302) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testCreateAppFinished(TestRMAppTransitions.java:648) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished(TestRMAppTransitions.java:1083) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runners.Suite.runChild(Suite.java:128) > at org.junit.runners.Suite.runChild(Suite.java:27) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9798) ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently
[ https://issues.apache.org/jira/browse/YARN-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919654#comment-16919654 ] Tao Yang commented on YARN-9798: Thanks [~abmodi] for the review. The frequency is only 1 or 2 failures in 2000 runs, and it didn't happen again after this fix. > ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails > intermittently > - > > Key: YARN-9798 > URL: https://issues.apache.org/jira/browse/YARN-9798 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-9798.001.patch > > > Found intermittent failure of > ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster in > YARN-9714 jenkins report, the cause is that the assertion which will make > sure dispatcher has handled UNREGISTERED event but not wait until all events > in dispatcher are handled, we need to add {{rm.drainEvents()}} before that > assertion to fix this issue. > Failure info: > {noformat} > [ERROR] > testRepeatedFinishApplicationMaster(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterServiceCapacity) > Time elapsed: 0.559 s <<< FAILURE! > java.lang.AssertionError: Expecting only one event expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase.testRepeatedFinishApplicationMaster(ApplicationMasterServiceTestBase.java:385) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > {noformat} > Standard output: > {noformat} > 2019-08-29 06:59:54,458 ERROR [AsyncDispatcher event handler] > resourcemanager.ResourceManager (ResourceManager.java:handle(1088)) - Error > in handling event type REGISTERED for applicationAttempt > appattempt_1567061994047_0001_01 > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:276) > at > org.apache.hadoop.yarn.event.DrainDispatcher$2.handle(DrainDispatcher.java:91) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1679) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1658) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:914) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1086) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1067) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:200) > at >
[jira] [Updated] (YARN-9798) ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently
[ https://issues.apache.org/jira/browse/YARN-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9798: --- Attachment: (was: YARN-9798.001.patch) > ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails > intermittently > - > > Key: YARN-9798 > URL: https://issues.apache.org/jira/browse/YARN-9798 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-9798.001.patch > > > Found intermittent failure of > ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster in > YARN-9714 jenkins report, the cause is that the assertion which will make > sure dispatcher has handled UNREGISTERED event but not wait until all events > in dispatcher are handled, we need to add {{rm.drainEvents()}} before that > assertion to fix this issue. > Failure info: > {noformat} > [ERROR] > testRepeatedFinishApplicationMaster(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterServiceCapacity) > Time elapsed: 0.559 s <<< FAILURE! > java.lang.AssertionError: Expecting only one event expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase.testRepeatedFinishApplicationMaster(ApplicationMasterServiceTestBase.java:385) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > {noformat} > Standard output: > {noformat} > 2019-08-29 06:59:54,458 ERROR [AsyncDispatcher event handler] > resourcemanager.ResourceManager (ResourceManager.java:handle(1088)) - Error > in handling event type REGISTERED for applicationAttempt > appattempt_1567061994047_0001_01 > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:276) > at > org.apache.hadoop.yarn.event.DrainDispatcher$2.handle(DrainDispatcher.java:91) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1679) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1658) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:914) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1086) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1067) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:200) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase$CountingDispatcher.dispatch(ApplicationMasterServiceTestBase.java:401) > at >
[jira] [Updated] (YARN-9798) ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently
[ https://issues.apache.org/jira/browse/YARN-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9798: --- Attachment: YARN-9798.001.patch > ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails > intermittently > - > > Key: YARN-9798 > URL: https://issues.apache.org/jira/browse/YARN-9798 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-9798.001.patch > > > Found intermittent failure of > ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster in > YARN-9714 jenkins report, the cause is that the assertion which will make > sure dispatcher has handled UNREGISTERED event but not wait until all events > in dispatcher are handled, we need to add {{rm.drainEvents()}} before that > assertion to fix this issue. > Failure info: > {noformat} > [ERROR] > testRepeatedFinishApplicationMaster(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterServiceCapacity) > Time elapsed: 0.559 s <<< FAILURE! > java.lang.AssertionError: Expecting only one event expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase.testRepeatedFinishApplicationMaster(ApplicationMasterServiceTestBase.java:385) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > {noformat} > Standard output: > {noformat} > 2019-08-29 06:59:54,458 ERROR [AsyncDispatcher event handler] > resourcemanager.ResourceManager (ResourceManager.java:handle(1088)) - Error > in handling event type REGISTERED for applicationAttempt > appattempt_1567061994047_0001_01 > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:276) > at > org.apache.hadoop.yarn.event.DrainDispatcher$2.handle(DrainDispatcher.java:91) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1679) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1658) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:914) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1086) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1067) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:200) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase$CountingDispatcher.dispatch(ApplicationMasterServiceTestBase.java:401) > at >
[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
[ https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919204#comment-16919204 ] Tao Yang commented on YARN-9714: Thanks [~rohithsharma], [~bibinchundatt] for the review and commit! > ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby > - > > Key: YARN-9714 > URL: https://issues.apache.org/jira/browse/YARN-9714 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: memory-leak > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9714.001.patch, YARN-9714.002.patch, > YARN-9714.003.patch, YARN-9714.004.patch, YARN-9714.005.patch > > > Recently RM full GC happened in one of our clusters, after investigating the > dump memory and jstack, I found two places in RM may cause memory leaks after > RM transitioned to standby: > # Release cache cleanup timer in AbstractYarnScheduler never be canceled. > # ZooKeeper connection in ZKRMStateStore never be closed. > To solve those leaks, we should close the connection or cancel the timer when > services are stopping. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9803) NPE while accessing Scheduler UI
[ https://issues.apache.org/jira/browse/YARN-9803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang resolved YARN-9803. Resolution: Duplicate Hi, [~yifan.stan]. This is a duplicate of YARN-9685, closing it as duplicate. > NPE while accessing Scheduler UI > > > Key: YARN-9803 > URL: https://issues.apache.org/jira/browse/YARN-9803 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Xie YiFan >Assignee: Xie YiFan >Priority: Major > Attachments: YARN-9803-branch-3.1.1.001.patch > > > The same with what described in YARN-4624 > Scenario: > === > if not configure all queue's capacity to nodelabel even the value is 0, start > cluster and access capacityscheduler page. > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:97) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:243) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at > org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$LI.__(Hamlet.java:7709) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueueBlock.render(CapacitySchedulerPage.java:342) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:243) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at > org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$LI.__(Hamlet.java:7709) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:513) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:243) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:86) > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9540) TestRMAppTransitions fails intermittently
[ https://issues.apache.org/jira/browse/YARN-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919097#comment-16919097 ] Tao Yang edited comment on YARN-9540 at 8/30/19 2:00 AM: - Hi, [~adam.antal]. The cause is that the assertion which will make sure dispatcher have handled event but there is no wait before this assertion, we need to add {{rmDispatcher.await()}} like others in TestRMAppTransitions to fix this issue. In my local test, about 5+ failures may happened in 1000 runs. After applying the patch, I didn't see it again. was (Author: tao yang): Hi, [~adam.antal]. The cause is that the assertion which will make sure dispatcher have handled event but not wait, we need to add {{rmDispatcher.await()}} before that assertion like others in TestRMAppTransitions to fix this issue. In my local test, about 5+ failures may happened in 1000 runs. After applying the patch, I didn't see it again. > TestRMAppTransitions fails intermittently > - > > Key: YARN-9540 > URL: https://issues.apache.org/jira/browse/YARN-9540 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, test >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-9540.001.patch > > > Failed > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished[0] > {code} > Error Message > expected:<1> but was:<0> > Stacktrace > java.lang.AssertionError: expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppCompletedEvent(TestRMAppTransitions.java:1307) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppAfterFinishEvent(TestRMAppTransitions.java:1302) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testCreateAppFinished(TestRMAppTransitions.java:648) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished(TestRMAppTransitions.java:1083) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runners.Suite.runChild(Suite.java:128) > at org.junit.runners.Suite.runChild(Suite.java:27) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at >
[jira] [Commented] (YARN-9540) TestRMAppTransitions fails intermittently
[ https://issues.apache.org/jira/browse/YARN-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919097#comment-16919097 ] Tao Yang commented on YARN-9540: Hi, [~adam.antal]. The cause is that the assertion which will make sure dispatcher have handled event but not wait, we need to add {{rmDispatcher.await()}} before that assertion like others in TestRMAppTransitions to fix this issue. In my local test, about 5+ failures may happened in 1000 runs. After applying the patch, I didn't see it again. > TestRMAppTransitions fails intermittently > - > > Key: YARN-9540 > URL: https://issues.apache.org/jira/browse/YARN-9540 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, test >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-9540.001.patch > > > Failed > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished[0] > {code} > Error Message > expected:<1> but was:<0> > Stacktrace > java.lang.AssertionError: expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppCompletedEvent(TestRMAppTransitions.java:1307) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppAfterFinishEvent(TestRMAppTransitions.java:1302) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testCreateAppFinished(TestRMAppTransitions.java:648) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished(TestRMAppTransitions.java:1083) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runners.Suite.runChild(Suite.java:128) > at org.junit.runners.Suite.runChild(Suite.java:27) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > {code} -- This message was sent by Atlassian Jira
[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding
[ https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918511#comment-16918511 ] Tao Yang commented on YARN-9664: Thanks [~cheersyang] for the review and commit! > Improve response of scheduler/app activities for better understanding > - > > Key: YARN-9664 > URL: https://issues.apache.org/jira/browse/YARN-9664 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9664.001.patch, YARN-9664.002.patch, > YARN-9664.003.patch > > > Currently some diagnostics are not easy enough to understand for common > users, and I found some places still need to be improved such as no partition > information and lacking of necessary activities. This issue is to improve > these shortcomings. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9538) Document scheduler/app activities and REST APIs
[ https://issues.apache.org/jira/browse/YARN-9538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918510#comment-16918510 ] Tao Yang commented on YARN-9538: Thanks [~cheersyang] for reminding me, I will do that later. > Document scheduler/app activities and REST APIs > --- > > Key: YARN-9538 > URL: https://issues.apache.org/jira/browse/YARN-9538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9538.001.patch > > > Add documentation for scheduler/app activities in CapacityScheduler.md and > ResourceManagerRest.md. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9540) TestRMAppTransitions fails intermittently
[ https://issues.apache.org/jira/browse/YARN-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9540: --- Attachment: YARN-9540.001.patch > TestRMAppTransitions fails intermittently > - > > Key: YARN-9540 > URL: https://issues.apache.org/jira/browse/YARN-9540 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, test >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Minor > Attachments: YARN-9540.001.patch > > > Failed > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished[0] > {code} > Error Message > expected:<1> but was:<0> > Stacktrace > java.lang.AssertionError: expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppCompletedEvent(TestRMAppTransitions.java:1307) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppAfterFinishEvent(TestRMAppTransitions.java:1302) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testCreateAppFinished(TestRMAppTransitions.java:648) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished(TestRMAppTransitions.java:1083) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runners.Suite.runChild(Suite.java:128) > at org.junit.runners.Suite.runChild(Suite.java:27) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9540) TestRMAppTransitions fails intermittently
[ https://issues.apache.org/jira/browse/YARN-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang reassigned YARN-9540: -- Assignee: Tao Yang (was: Prabhu Joseph) > TestRMAppTransitions fails intermittently > - > > Key: YARN-9540 > URL: https://issues.apache.org/jira/browse/YARN-9540 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, test >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-9540.001.patch > > > Failed > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished[0] > {code} > Error Message > expected:<1> but was:<0> > Stacktrace > java.lang.AssertionError: expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppCompletedEvent(TestRMAppTransitions.java:1307) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.verifyAppAfterFinishEvent(TestRMAppTransitions.java:1302) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testCreateAppFinished(TestRMAppTransitions.java:648) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions.testAppFinishedFinished(TestRMAppTransitions.java:1083) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runners.Suite.runChild(Suite.java:128) > at org.junit.runners.Suite.runChild(Suite.java:27) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9799) TestRMAppTransitions#testAppFinishedFinished fails intermittently
[ https://issues.apache.org/jira/browse/YARN-9799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918506#comment-16918506 ] Tao Yang commented on YARN-9799: Thanks [~Prabhu Joseph] for reminding me, I'll fix this issue over there. > TestRMAppTransitions#testAppFinishedFinished fails intermittently > - > > Key: YARN-9799 > URL: https://issues.apache.org/jira/browse/YARN-9799 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-9799.001.patch > > > Found intermittent failure of TestRMAppTransitions#testAppFinishedFinished in > YARN-9664 jenkins report, the cause is that the assertion which will make > sure dispatcher has handled APP_COMPLETED event but not wait, we need to add > {{rmDispatcher.await()}} before that assertion like others in this class to > fix this issue. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding
[ https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918469#comment-16918469 ] Tao Yang commented on YARN-9664: Hi, [~cheersyang]. {quote} UT seems not related to this patch, Tao Yang, could you please confirm? {quote} Yes, it's not related to this patch, I have created YARN-9799 to fix it. Thanks. > Improve response of scheduler/app activities for better understanding > - > > Key: YARN-9664 > URL: https://issues.apache.org/jira/browse/YARN-9664 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9664.001.patch, YARN-9664.002.patch, > YARN-9664.003.patch > > > Currently some diagnostics are not easy enough to understand for common > users, and I found some places still need to be improved such as no partition > information and lacking of necessary activities. This issue is to improve > these shortcomings. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9799) TestRMAppTransitions#testAppFinishedFinished fails intermittently
[ https://issues.apache.org/jira/browse/YARN-9799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9799: --- Attachment: YARN-9799.001.patch > TestRMAppTransitions#testAppFinishedFinished fails intermittently > - > > Key: YARN-9799 > URL: https://issues.apache.org/jira/browse/YARN-9799 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-9799.001.patch > > > Found intermittent failure of TestRMAppTransitions#testAppFinishedFinished in > YARN-9664 jenkins report, the cause is that the assertion which will make > sure dispatcher has handled APP_COMPLETED event but not wait, we need to add > {{rmDispatcher.await()}} before that assertion like others in this class to > fix this issue. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9799) TestRMAppTransitions#testAppFinishedFinished fails intermittently
Tao Yang created YARN-9799: -- Summary: TestRMAppTransitions#testAppFinishedFinished fails intermittently Key: YARN-9799 URL: https://issues.apache.org/jira/browse/YARN-9799 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: Tao Yang Assignee: Tao Yang Found intermittent failure of TestRMAppTransitions#testAppFinishedFinished in YARN-9664 jenkins report, the cause is that the assertion which will make sure dispatcher has handled APP_COMPLETED event but not wait, we need to add {{rmDispatcher.await()}} before that assertion like others in this class to fix this issue. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
[ https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918446#comment-16918446 ] Tao Yang commented on YARN-9714: There is an intermittent UT failure in the latest jenkins report, I have created YARN-9798 to fix it. > ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby > - > > Key: YARN-9714 > URL: https://issues.apache.org/jira/browse/YARN-9714 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: memory-leak > Attachments: YARN-9714.001.patch, YARN-9714.002.patch, > YARN-9714.003.patch, YARN-9714.004.patch, YARN-9714.005.patch > > > Recently RM full GC happened in one of our clusters, after investigating the > dump memory and jstack, I found two places in RM may cause memory leaks after > RM transitioned to standby: > # Release cache cleanup timer in AbstractYarnScheduler never be canceled. > # ZooKeeper connection in ZKRMStateStore never be closed. > To solve those leaks, we should close the connection or cancel the timer when > services are stopping. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9798) ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently
[ https://issues.apache.org/jira/browse/YARN-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9798: --- Attachment: YARN-9798.001.patch > ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails > intermittently > - > > Key: YARN-9798 > URL: https://issues.apache.org/jira/browse/YARN-9798 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-9798.001.patch > > > Found intermittent failure of > ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster in > YARN-9714 jenkins report, the cause is that the assertion which will make > sure dispatcher has handled UNREGISTERED event but not wait until all events > in dispatcher are handled, we need to add {{rm.drainEvents()}} before that > assertion to fix this issue. > Failure info: > {noformat} > [ERROR] > testRepeatedFinishApplicationMaster(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterServiceCapacity) > Time elapsed: 0.559 s <<< FAILURE! > java.lang.AssertionError: Expecting only one event expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase.testRepeatedFinishApplicationMaster(ApplicationMasterServiceTestBase.java:385) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > {noformat} > Standard output: > {noformat} > 2019-08-29 06:59:54,458 ERROR [AsyncDispatcher event handler] > resourcemanager.ResourceManager (ResourceManager.java:handle(1088)) - Error > in handling event type REGISTERED for applicationAttempt > appattempt_1567061994047_0001_01 > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:276) > at > org.apache.hadoop.yarn.event.DrainDispatcher$2.handle(DrainDispatcher.java:91) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1679) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1658) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:914) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1086) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1067) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:200) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase$CountingDispatcher.dispatch(ApplicationMasterServiceTestBase.java:401) > at >
[jira] [Created] (YARN-9798) ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently
Tao Yang created YARN-9798: -- Summary: ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster fails intermittently Key: YARN-9798 URL: https://issues.apache.org/jira/browse/YARN-9798 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: Tao Yang Assignee: Tao Yang Found intermittent failure of ApplicationMasterServiceTestBase#testRepeatedFinishApplicationMaster in YARN-9714 jenkins report, the cause is that the assertion which will make sure dispatcher has handled UNREGISTERED event but not wait until all events in dispatcher are handled, we need to add {{rm.drainEvents()}} before that assertion to fix this issue. Failure info: {noformat} [ERROR] testRepeatedFinishApplicationMaster(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterServiceCapacity) Time elapsed: 0.559 s <<< FAILURE! java.lang.AssertionError: Expecting only one event expected:<1> but was:<0> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:834) at org.junit.Assert.assertEquals(Assert.java:645) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase.testRepeatedFinishApplicationMaster(ApplicationMasterServiceTestBase.java:385) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {noformat} Standard output: {noformat} 2019-08-29 06:59:54,458 ERROR [AsyncDispatcher event handler] resourcemanager.ResourceManager (ResourceManager.java:handle(1088)) - Error in handling event type REGISTERED for applicationAttempt appattempt_1567061994047_0001_01 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.InterruptedException at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:276) at org.apache.hadoop.yarn.event.DrainDispatcher$2.handle(DrainDispatcher.java:91) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1679) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1658) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:914) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1086) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1067) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:200) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterServiceTestBase$CountingDispatcher.dispatch(ApplicationMasterServiceTestBase.java:401) at org.apache.hadoop.yarn.event.DrainDispatcher$1.run(DrainDispatcher.java:76) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) at
[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding
[ https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918321#comment-16918321 ] Tao Yang commented on YARN-9664: Thanks [~cheersyang] for the advice. Attached v3 patch. > Improve response of scheduler/app activities for better understanding > - > > Key: YARN-9664 > URL: https://issues.apache.org/jira/browse/YARN-9664 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9664.001.patch, YARN-9664.002.patch, > YARN-9664.003.patch > > > Currently some diagnostics are not easy enough to understand for common > users, and I found some places still need to be improved such as no partition > information and lacking of necessary activities. This issue is to improve > these shortcomings. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9664) Improve response of scheduler/app activities for better understanding
[ https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9664: --- Attachment: YARN-9664.003.patch > Improve response of scheduler/app activities for better understanding > - > > Key: YARN-9664 > URL: https://issues.apache.org/jira/browse/YARN-9664 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9664.001.patch, YARN-9664.002.patch, > YARN-9664.003.patch > > > Currently some diagnostics are not easy enough to understand for common > users, and I found some places still need to be improved such as no partition > information and lacking of necessary activities. This issue is to improve > these shortcomings. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
[ https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918273#comment-16918273 ] Tao Yang commented on YARN-9714: Hi, [~rohithsharma]. UT log is filled with these errors: "java.lang.OutOfMemoryError: unable to create new native thread", perhaps threads were exhausted at that time on one of jenkins nodes. Could you please tell me how to retriger jenkins without updating the patch or status? > ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby > - > > Key: YARN-9714 > URL: https://issues.apache.org/jira/browse/YARN-9714 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: memory-leak > Attachments: YARN-9714.001.patch, YARN-9714.002.patch, > YARN-9714.003.patch, YARN-9714.004.patch, YARN-9714.005.patch > > > Recently RM full GC happened in one of our clusters, after investigating the > dump memory and jstack, I found two places in RM may cause memory leaks after > RM transitioned to standby: > # Release cache cleanup timer in AbstractYarnScheduler never be canceled. > # ZooKeeper connection in ZKRMStateStore never be closed. > To solve those leaks, we should close the connection or cancel the timer when > services are stopping. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding
[ https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918247#comment-16918247 ] Tao Yang commented on YARN-9664: Thanks [~cheersyang] for the review. {quote} ActivitiesUtils Line 56: I noticed that the 1st filter is to filter out null objects {quote} Aim of that is to filter out node-level activities instead of null objects, we use {{e.getNodeId() != null}} since only node-level activities has non-null nodeIds. {quote} what does "single placement node" mean here? {quote} "single placement node" means this scheduling process is based on a single node, I want to use it to distinguish from multi-nodes placement scenarios, however it seems not suitable, I'll be glad if you have better description for it. {quote} "Node skipped because of no off-switch and locality violation" I am also not quite sure what does this mean, can you please elaborate? {quote} It means request have only node_local type or rack_local type but no off-switch type, and node/rack locality can't be satisfied. {quote} line 650: is it safe to the check: "if (node != null && !isReserved)" here? {quote} I think there is no need to add the check above. No matter whether node is null and what type is the assignment, activities which is required should be finished when reaching here. Others are fine for me, I will update the patch after all points above are confirmed. Thanks. > Improve response of scheduler/app activities for better understanding > - > > Key: YARN-9664 > URL: https://issues.apache.org/jira/browse/YARN-9664 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9664.001.patch, YARN-9664.002.patch > > > Currently some diagnostics are not easy enough to understand for common > users, and I found some places still need to be improved such as no partition > information and lacking of necessary activities. This issue is to improve > these shortcomings. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9664) Improve response of scheduler/app activities for better understanding
[ https://issues.apache.org/jira/browse/YARN-9664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917516#comment-16917516 ] Tao Yang commented on YARN-9664: Hi, [~cheersyang], it indeed changes a lot and most of them are state/info improvements, I think most output of these changes are expected but maybe some are still need to be improved, please feel free to give your advice, thanks. > Improve response of scheduler/app activities for better understanding > - > > Key: YARN-9664 > URL: https://issues.apache.org/jira/browse/YARN-9664 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9664.001.patch, YARN-9664.002.patch > > > Currently some diagnostics are not easy enough to understand for common > users, and I found some places still need to be improved such as no partition > information and lacking of necessary activities. This issue is to improve > these shortcomings. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917507#comment-16917507 ] Tao Yang commented on YARN-8995: Hi, [~zhuqi]. The latest patch seems not applicable for trunk now, could you please rebase and update it? The latest patch has two places need to be updated or confirmed: 1. The prefix of YARN_DISPATCHER_PRINT_EVENTS_INFO_THRESHOLD is "yarn.yarn." 2. Why need this update: LOG.fatal("Error in dispatcher thread", t) --> LOG.error(FATAL, "Error in dispatcher thread", t) ? > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Fix For: 3.2.0 > > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
[ https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9714: --- Attachment: YARN-9714.005.patch > ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby > - > > Key: YARN-9714 > URL: https://issues.apache.org/jira/browse/YARN-9714 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: memory-leak > Attachments: YARN-9714.001.patch, YARN-9714.002.patch, > YARN-9714.003.patch, YARN-9714.004.patch, YARN-9714.005.patch > > > Recently RM full GC happened in one of our clusters, after investigating the > dump memory and jstack, I found two places in RM may cause memory leaks after > RM transitioned to standby: > # Release cache cleanup timer in AbstractYarnScheduler never be canceled. > # ZooKeeper connection in ZKRMStateStore never be closed. > To solve those leaks, we should close the connection or cancel the timer when > services are stopping. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
[ https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917482#comment-16917482 ] Tao Yang commented on YARN-9714: {quote} Instead of comparing, how about checking for resourceManager.getZKManager() == null? This basically sync the code where zkManager initialization to closing it. {quote} Make sense to me. Attached v5 patch for this, thanks! > ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby > - > > Key: YARN-9714 > URL: https://issues.apache.org/jira/browse/YARN-9714 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: memory-leak > Attachments: YARN-9714.001.patch, YARN-9714.002.patch, > YARN-9714.003.patch, YARN-9714.004.patch, YARN-9714.005.patch > > > Recently RM full GC happened in one of our clusters, after investigating the > dump memory and jstack, I found two places in RM may cause memory leaks after > RM transitioned to standby: > # Release cache cleanup timer in AbstractYarnScheduler never be canceled. > # ZooKeeper connection in ZKRMStateStore never be closed. > To solve those leaks, we should close the connection or cancel the timer when > services are stopping. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
[ https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9714: --- Attachment: YARN-9714.004.patch > ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby > - > > Key: YARN-9714 > URL: https://issues.apache.org/jira/browse/YARN-9714 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: memory-leak > Attachments: YARN-9714.001.patch, YARN-9714.002.patch, > YARN-9714.003.patch, YARN-9714.004.patch > > > Recently RM full GC happened in one of our clusters, after investigating the > dump memory and jstack, I found two places in RM may cause memory leaks after > RM transitioned to standby: > # Release cache cleanup timer in AbstractYarnScheduler never be canceled. > # ZooKeeper connection in ZKRMStateStore never be closed. > To solve those leaks, we should close the connection or cancel the timer when > services are stopping. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8917) Absolute (maximum) capacity of level3+ queues is wrongly calculated for absolute resource
[ https://issues.apache.org/jira/browse/YARN-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916291#comment-16916291 ] Tao Yang commented on YARN-8917: Thanks [~rohithsharma], [~leftnoteasy], [~sunilg] for the review and commit! > Absolute (maximum) capacity of level3+ queues is wrongly calculated for > absolute resource > - > > Key: YARN-8917 > URL: https://issues.apache.org/jira/browse/YARN-8917 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.1 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-8917.001.patch, YARN-8917.002.patch > > > Absolute capacity should be equal to multiply capacity by parent-queue's > absolute-capacity, > but currently it's calculated as dividing capacity by parent-queue's > absolute-capacity. > Calculation for absolute-maximum-capacity has the same problem. > For example: > root.a capacity=0.4 maximum-capacity=0.8 > root.a.a1 capacity=0.5 maximum-capacity=0.6 > Absolute capacity of root.a.a1 should be 0.2 but is wrongly calculated as 1.25 > Absolute maximum capacity of root.a.a1 should be 0.48 but is wrongly > calculated as 0.75 > Moreover: > {{childQueue.getQueueCapacities().getCapacity()}} should be changed to > {{childQueue.getQueueCapacities().getCapacity(label)}} to avoid getting wrong > capacity from default partition when calculating for a non-default partition. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
[ https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916290#comment-16916290 ] Tao Yang commented on YARN-9714: TestZKRMStateStore#testZKRootPathAcls UT failure is caused by itself, stateStore (ZKRMStateStore instance) used for verification is not updated after RM HA transition. Will attach v4 patch to fix this UT problem. > ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby > - > > Key: YARN-9714 > URL: https://issues.apache.org/jira/browse/YARN-9714 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: memory-leak > Attachments: YARN-9714.001.patch, YARN-9714.002.patch, > YARN-9714.003.patch > > > Recently RM full GC happened in one of our clusters, after investigating the > dump memory and jstack, I found two places in RM may cause memory leaks after > RM transitioned to standby: > # Release cache cleanup timer in AbstractYarnScheduler never be canceled. > # ZooKeeper connection in ZKRMStateStore never be closed. > To solve those leaks, we should close the connection or cancel the timer when > services are stopping. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.
[ https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915685#comment-16915685 ] Tao Yang commented on YARN-8193: Hi, [~sunilg], [~leftnoteasy]. Any updates or plans about this fix on branch-2.x? YARN-9779 seems to be the same issue. > YARN RM hangs abruptly (stops allocating resources) when running successive > applications. > - > > Key: YARN-8193 > URL: https://issues.apache.org/jira/browse/YARN-8193 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Blocker > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8193-branch-2-001.patch, > YARN-8193-branch-2.9.0-001.patch, YARN-8193.001.patch, YARN-8193.002.patch > > > When running massive queries successively, at some point RM just hangs and > stops allocating resources. At the point RM get hangs, YARN throw > NullPointerException at RegularContainerAllocator.getLocalityWaitFactor. > There's sufficient space given to yarn.nodemanager.local-dirs (not a node > health issue, RM didn't report any node being unhealthy). There is no fixed > trigger for this (query or operation). > This problem goes away on restarting ResourceManager. No NM restart is > required. > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9779) NPE while allocating a container
[ https://issues.apache.org/jira/browse/YARN-9779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915684#comment-16915684 ] Tao Yang commented on YARN-9779: Sorry for the late reply. I think this issue is duplicate with YARN-8193. > NPE while allocating a container > > > Key: YARN-9779 > URL: https://issues.apache.org/jira/browse/YARN-9779 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0 >Reporter: Amithsha >Priority: Critical > > Getting the following exception while allocating a container > > 2019-08-22 23:59:20,180 FATAL event.EventDispatcher (?:?(?)) - Error in > handling event type NODE_UPDATE to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:301) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:250) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:819) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:857) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:55) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:868) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1121) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1346) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1341) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1430) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1205) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1067) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1472) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:151) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:745) > 2019-08-22 23:59:20,180 INFO rmcontainer.RMContainerImpl (?:?(?)) - > container_e2364_1565770624228_198773_01_000946 Container Transitioned from > ALLOCATED to ACQUIRED > 2019-08-22 23:59:20,180 INFO event.EventDispatcher (?:?(?)) - Exiting, bbye.. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-9779) NPE while allocating a container
[ https://issues.apache.org/jira/browse/YARN-9779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9779: --- Comment: was deleted (was: Sorry for the late reply. I think this issue is duplicate with YARN-8193.) > NPE while allocating a container > > > Key: YARN-9779 > URL: https://issues.apache.org/jira/browse/YARN-9779 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0 >Reporter: Amithsha >Priority: Critical > > Getting the following exception while allocating a container > > 2019-08-22 23:59:20,180 FATAL event.EventDispatcher (?:?(?)) - Error in > handling event type NODE_UPDATE to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:301) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:388) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:250) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:819) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:857) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:55) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:868) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1121) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1346) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1341) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1430) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1205) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1067) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1472) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:151) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:745) > 2019-08-22 23:59:20,180 INFO rmcontainer.RMContainerImpl (?:?(?)) - > container_e2364_1565770624228_198773_01_000946 Container Transitioned from > ALLOCATED to ACQUIRED > 2019-08-22 23:59:20,180 INFO event.EventDispatcher (?:?(?)) - Exiting, bbye.. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9714) ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby
[ https://issues.apache.org/jira/browse/YARN-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9714: --- Attachment: YARN-9714.003.patch > ZooKeeper connection in ZKRMStateStore leaks after RM transitioned to standby > - > > Key: YARN-9714 > URL: https://issues.apache.org/jira/browse/YARN-9714 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: memory-leak > Attachments: YARN-9714.001.patch, YARN-9714.002.patch, > YARN-9714.003.patch > > > Recently RM full GC happened in one of our clusters, after investigating the > dump memory and jstack, I found two places in RM may cause memory leaks after > RM transitioned to standby: > # Release cache cleanup timer in AbstractYarnScheduler never be canceled. > # ZooKeeper connection in ZKRMStateStore never be closed. > To solve those leaks, we should close the connection or cancel the timer when > services are stopping. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org