[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-02-14 Thread Jian Chen (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492316#comment-17492316
 ] 

Jian Chen commented on YARN-11073:
--

Thank you for taking a look at this issue [~aajisaka], and for the patching 
tips. My goal with the current patch is to isolate the fix to where the problem 
is without changing other places (like how a queue's guaranteedCapacity should 
be calculated) to minimize potential side effects. If it's safe to apply 
simpler fixes, that would be great.

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Priority: Major
> Attachments: YARN-11073.tmp-1.patch
>
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-02-14 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492307#comment-17492307
 ] 

Akira Ajisaka commented on YARN-11073:
--

Note that you will need to rename patch such as 
"YARN-11073-branch-2.10-001.patch" to test patch against branch-2.10. Creating 
a GitHub pull request in https://github.com/apache/hadoop is preferred approach 
rather than attaching a patch here. In addition, the target branch must be 
trunk. All the patches first go to trunk and then backported to other release 
branches.

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Priority: Major
> Attachments: YARN-11073.tmp-1.patch
>
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-02-14 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492305#comment-17492305
 ] 

Akira Ajisaka commented on YARN-11073:
--

Thank you [~jchenjc22] for your report. I'm +1 to always roundUp when 
calculating a queue's guaranteed capacity because it is very simple fix.

Hi [~wangda], do you have any suggestion?

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Priority: Major
> Attachments: YARN-11073.tmp-1.patch
>
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9551) TestTimelineClientV2Impl.testSyncCall fails intermittently

2022-02-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-9551:
---
Fix Version/s: 3.3.2

> TestTimelineClientV2Impl.testSyncCall fails intermittently
> --
>
> Key: YARN-9551
> URL: https://issues.apache.org/jira/browse/YARN-9551
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2, test
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Andras Gyori
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.2, 3.1.5
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> TestTimelineClientV2Impl.testSyncCall fails intermittent
> {code:java}
> Failed
> org.apache.hadoop.yarn.client.api.impl.TestTimelineClientV2Impl.testSyncCall
> Failing for the past 1 build (Since #24083 )
> Took 1.5 sec.
> Error Message
> TimelineEntities not published as desired expected:<3> but was:<4>
> Stacktrace
> java.lang.AssertionError: TimelineEntities not published as desired 
> expected:<3> but was:<4>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestTimelineClientV2Impl.testSyncCall(TestTimelineClientV2Impl.java:251)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Standard Output
> 2019-05-13 15:33:46,596 WARN  [main] util.NativeCodeLoader 
> (NativeCodeLoader.java:(60)) - Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 2019-05-13 15:33:47,763 INFO  [main] impl.TestTimelineClientV2Impl 
> (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities 
> Published @ index 0 : 1,
> 2019-05-13 15:33:47,764 INFO  [main] impl.TestTimelineClientV2Impl 
> (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities 
> Published @ index 1 : 2,
> 2019-05-13 15:33:47,764 INFO  [main] impl.TestTimelineClientV2Impl 
> (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities 
> Published @ index 2 : 3,
> 2019-05-13 15:33:47,764 INFO  [main] impl.TestTimelineClientV2Impl 
> (TestTimelineClientV2Impl.java:printReceivedEntities(413)) - Entities 
> Published @ index 3 : 4,
> 2019-05-13 15:33:47,765 INFO  [main] imp

[jira] [Updated] (YARN-10991) Fix to ignore the grouping "[]" for resourcesStr in parseResourcesString method

2022-02-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-10991:

Fix Version/s: 3.3.2
   (was: 3.3.3)

> Fix to ignore the grouping "[]" for resourcesStr in parseResourcesString 
> method
> ---
>
> Key: YARN-10991
> URL: https://issues.apache.org/jira/browse/YARN-10991
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell
>Affects Versions: 3.3.1
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.2, 3.2.4
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Currently if the resourcesStr argument in parseResourcesString method 
> contains "]" at the end, its not being ignored. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11007) Correct words in YARN documents

2022-02-14 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated YARN-11007:

Fix Version/s: 3.3.2
   (was: 3.3.3)

> Correct words in YARN documents
> ---
>
> Key: YARN-11007
> URL: https://issues.apache.org/jira/browse/YARN-11007
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 3.3.1
>Reporter: guophilipse
>Assignee: guophilipse
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.2, 3.2.4
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10561) Upgrade node.js to 12.22.1 and yarn to 1.22.5 in YARN application catalog webapp

2022-02-14 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated YARN-10561:

Fix Version/s: 3.3.2
   (was: 3.3.3)

> Upgrade node.js to 12.22.1 and yarn to 1.22.5 in YARN application catalog 
> webapp
> 
>
> Key: YARN-10561
> URL: https://issues.apache.org/jira/browse/YARN-10561
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.2
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> YARN application catalog webapp is using node.js 8.11.3, and 8.x are already 
> EoL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org