[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492316#comment-17492316 ] Jian Chen commented on YARN-11073: -- Thank you for taking a look at this issue [~aajisaka], and for the patching tips. My goal with the current patch is to isolate the fix to where the problem is without changing other places (like how a queue's guaranteedCapacity should be calculated) to minimize potential side effects. If it's safe to apply simpler fixes, that would be great. > CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues > -- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Priority: Major > Attachments: YARN-11073.tmp-1.patch > > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492307#comment-17492307 ] Akira Ajisaka commented on YARN-11073: -- Note that you will need to rename patch such as "YARN-11073-branch-2.10-001.patch" to test patch against branch-2.10. Creating a GitHub pull request in https://github.com/apache/hadoop is preferred approach rather than attaching a patch here. In addition, the target branch must be trunk. All the patches first go to trunk and then backported to other release branches. > CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues > -- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Priority: Major > Attachments: YARN-11073.tmp-1.patch > > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
[ https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492305#comment-17492305 ] Akira Ajisaka commented on YARN-11073: -- Thank you [~jchenjc22] for your report. I'm +1 to always roundUp when calculating a queue's guaranteed capacity because it is very simple fix. Hi [~wangda], do you have any suggestion? > CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues > -- > > Key: YARN-11073 > URL: https://issues.apache.org/jira/browse/YARN-11073 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.10.1 >Reporter: Jian Chen >Priority: Major > Attachments: YARN-11073.tmp-1.patch > > > When running a Hive job in a low-capacity queue on an idle cluster, > preemption kicked in to preempt job containers even though there's no other > job running and competing for resources. > Let's take this scenario as an example: > * cluster resource : > ** {_}*queue_low*{_}: min_capacity 1% > ** queue_mid: min_capacity 19% > ** queue_high: min_capacity 80% > * CapacityScheduler with DRF > During the fifo preemption candidates selection process, the > _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" > which depends on each queue's guaranteed/min capacity. A queue's guaranteed > capacity is currently calculated as > "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed > capacity of queue_low is: > * {_}*queue_low*{_}: = > , but since the Resource object takes only Long > values, these Doubles values get casted into Long, and then the final result > becomes ** > Because the guaranteed capacity of queue_low is 0, its normalized guaranteed > capacity based on active queues is also 0 based on the current algorithm in > "{_}resetCapacity{_}". This eventually leads to the continuous preemption of > job containers running in {_}*queue_low*{_}. > In order to work around this corner case, I made a small patch (for my own > use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: > * if the sum of absoluteCapacity/minCapacity of all active queues is zero, > we should normalize their guaranteed capacity evenly > {code:java} > 1.0f / num_of_queues{code} > * if the sum of pre-normalized guaranteed capacity values ({_}MB or > VCores{_}) of all active queues is zero, meaning we might have several queues > like queue_low whose capacity value got casted into 0, we should normalize > evenly as well like the first scenario (if they are all tiny, it really makes > no big difference, for example, 1% vs 1.2%). > * if one of the active queues has a zero pre-normalized guaranteed capacity > value but its absoluteCapacity/minCapacity is *not* zero, then we should > normalize based on the weight of their configured queue > absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small > but fair normalized value when _*queue_mid*_ is also active. > {code:java} > minCapacity / (sum_of_min_capacity_of_active_queues) > {code} > > This is how I currently work around this issue, it might need someone who's > more familiar in this component to do a systematic review of the entire > preemption process to fix it properly. Maybe we can always apply the > weight-based approach using absoluteCapacity, or rewrite the code of Resource > to remove the casting, or always roundUp when calculating a queue's > guaranteed capacity, etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9551) TestTimelineClientV2Impl.testSyncCall fails intermittently
[ https://issues.apache.org/jira/browse/YARN-9551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated YARN-9551: --- Fix Version/s: 3.3.2 > TestTimelineClientV2Impl.testSyncCall fails intermittently > -- > > Key: YARN-9551 > URL: https://issues.apache.org/jira/browse/YARN-9551 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2, test >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Andras Gyori >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.2.3, 3.3.2, 3.1.5 > > Time Spent: 40m > Remaining Estimate: 0h > > TestTimelineClientV2Impl.testSyncCall fails intermittent > {code:java} > Failed > org.apache.hadoop.yarn.client.api.impl.TestTimelineClientV2Impl.testSyncCall > Failing for the past 1 build (Since #24083 ) > Took 1.5 sec. > Error Message > TimelineEntities not published as desired expected:<3> but was:<4> > Stacktrace > java.lang.AssertionError: TimelineEntities not published as desired > expected:<3> but was:<4> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at > org.apache.hadoop.yarn.client.api.impl.TestTimelineClientV2Impl.testSyncCall(TestTimelineClientV2Impl.java:251) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > Standard Output > 2019-05-13 15:33:46,596 WARN [main] util.NativeCodeLoader > (NativeCodeLoader.java:(60)) - Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 2019-05-13 15:33:47,763 INFO [main] impl.TestTimelineClientV2Impl > (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities > Published @ index 0 : 1, > 2019-05-13 15:33:47,764 INFO [main] impl.TestTimelineClientV2Impl > (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities > Published @ index 1 : 2, > 2019-05-13 15:33:47,764 INFO [main] impl.TestTimelineClientV2Impl > (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities > Published @ index 2 : 3, > 2019-05-13 15:33:47,764 INFO [main] impl.TestTimelineClientV2Impl > (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities > Published @ index 3 : 4, > 2019-05-13 15:33:47,765 INFO [main] imp
[jira] [Updated] (YARN-10991) Fix to ignore the grouping "[]" for resourcesStr in parseResourcesString method
[ https://issues.apache.org/jira/browse/YARN-10991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated YARN-10991: Fix Version/s: 3.3.2 (was: 3.3.3) > Fix to ignore the grouping "[]" for resourcesStr in parseResourcesString > method > --- > > Key: YARN-10991 > URL: https://issues.apache.org/jira/browse/YARN-10991 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-shell >Affects Versions: 3.3.1 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.3.2, 3.2.4 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Currently if the resourcesStr argument in parseResourcesString method > contains "]" at the end, its not being ignored. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11007) Correct words in YARN documents
[ https://issues.apache.org/jira/browse/YARN-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated YARN-11007: Fix Version/s: 3.3.2 (was: 3.3.3) > Correct words in YARN documents > --- > > Key: YARN-11007 > URL: https://issues.apache.org/jira/browse/YARN-11007 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 3.3.1 >Reporter: guophilipse >Assignee: guophilipse >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.3.2, 3.2.4 > > Time Spent: 2h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10561) Upgrade node.js to 12.22.1 and yarn to 1.22.5 in YARN application catalog webapp
[ https://issues.apache.org/jira/browse/YARN-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated YARN-10561: Fix Version/s: 3.3.2 (was: 3.3.3) > Upgrade node.js to 12.22.1 and yarn to 1.22.5 in YARN application catalog > webapp > > > Key: YARN-10561 > URL: https://issues.apache.org/jira/browse/YARN-10561 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Reporter: Akira Ajisaka >Assignee: Akira Ajisaka >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.3.2 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > YARN application catalog webapp is using node.js 8.11.3, and 8.x are already > EoL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org