[jira] [Commented] (YARN-10587) Fix AutoCreateLeafQueueCreation cap related caculation when in absolute mode.

2021-01-21 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269479#comment-17269479
 ] 

Wangda Tan commented on YARN-10587:
---

+1, thanks [~zhuqi]

> Fix AutoCreateLeafQueueCreation cap related caculation when in absolute mode.
> -
>
> Key: YARN-10587
> URL: https://issues.apache.org/jira/browse/YARN-10587
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10587.001.patch, YARN-10587.002.patch
>
>
> When introduced YARN-10504.
> The logic related to auto created leaf queue changed.
> The test in testAutoCreateLeafQueueCreation failed, we should fix the Error.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10531) Be able to disable user limit factor for CapacityScheduler Leaf Queue

2021-01-21 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269469#comment-17269469
 ] 

Wangda Tan commented on YARN-10531:
---

+1, thanks [~zhuqi], will get it in later today.

> Be able to disable user limit factor for CapacityScheduler Leaf Queue
> -
>
> Key: YARN-10531
> URL: https://issues.apache.org/jira/browse/YARN-10531
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10531.001.patch, YARN-10531.002.patch, 
> YARN-10531.003.patch, YARN-10531.004.patch, YARN-10531.005.patch, 
> YARN-10531.006.patch, YARN-10531.007.patch
>
>
> User limit factor is used to define max cap of how much resource can be 
> consumed by single user. 
> Under Auto Queue Creation context, it doesn't make much sense to set user 
> limit factor, because initially every queue will set weight to 1.0, we want 
> user can consume more resource if possible. It is hard to pre-determine how 
> to set up user limit factor. So it makes more sense to add a new value (like 
> -1) to indicate we will disable user limit factor 
> Logic need to be changed is below: 
> (Inside LeafQueue.java)
> {code}
> Resource maxUserLimit = Resources.none();
> if (schedulingMode == SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY) {
>   maxUserLimit = Resources.multiplyAndRoundDown(queueCapacity,
>   getUserLimitFactor());
> } else if (schedulingMode == SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY) 
> {
>   maxUserLimit = partitionResource;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10581) CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to include queue creation type for queues

2021-01-21 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269424#comment-17269424
 ] 

Wangda Tan commented on YARN-10581:
---

+1, patch looks good, thanks [~snemeth]

> CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to include 
> queue creation type for queues
> 
>
> Key: YARN-10581
> URL: https://issues.apache.org/jira/browse/YARN-10581
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10581.001.patch, YARN-10581.002.patch, 
> YARN-10581.003.patch
>
>
> Under this umbrella (YARN-10496), weight-mode has been implemented for CS 
> with YARN-10504.
> Auto-queue creation has been also imlemented with YARN-10506.
> Connected to this effort, we would like to expose the type of the queue with 
> the RM's /scheduler REST endpoint.
> The queue type should hold these values: 
>  * Auto-created parent queue: *autoCreatedParent*
>  * Auto-created leaf queue: *autoCreatedLeaf*
>  * Static parent: *staticParent*
>  * Static leaf: *staticLeaf* 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10579) CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to include weight values for queues

2021-01-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268857#comment-17268857
 ] 

Wangda Tan commented on YARN-10579:
---

Thanks [~snemeth], +1 to the latest patch. 

> CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to include 
> weight values for queues
> --
>
> Key: YARN-10579
> URL: https://issues.apache.org/jira/browse/YARN-10579
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10579.001.patch, YARN-10579.002.patch
>
>
> Under this umbrella (YARN-10496), weight-mode has been implemented for CS 
> with YARN-10504.
>  We would like to expose the weight values for all queues with the RM's 
> /scheduler REST endpoint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2021-01-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268757#comment-17268757
 ] 

Wangda Tan commented on YARN-10352:
---

cc: [~ztang] to do a review.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch, 
> YARN-10352-010.patch, YARN-10352.009.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10531) Be able to disable user limit factor for CapacityScheduler Leaf Queue

2021-01-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268755#comment-17268755
 ] 

Wangda Tan commented on YARN-10531:
---

Thanks [~zhuqi], 

Two minor comments: 

1) ParentQueue.java:  

- We can remove: 
  FIXME: Ideally we should disable user limit factor, see YARN-10531 

2) AbstractCSQueue: 

Nit: Let's breakdown 

{code} 
1542  int maxApplicationsPerUser =
1543  leafQueue.getUsersManager().getUserLimitFactor() != -1
1544  ? Math.min(maxApplications,
1545  (int) (maxApplications
1546  * (leafQueue.getUsersManager().getUserLimit() / 
100.0f)
1547  * leafQueue.getUsersManager().getUserLimitFactor()))
1548  : maxApplications;
{code}

Into multiple statements for better readability. 

Thoughts? [~sunilg], [~shuzirra], [~snemeth], [~pbacsko]

> Be able to disable user limit factor for CapacityScheduler Leaf Queue
> -
>
> Key: YARN-10531
> URL: https://issues.apache.org/jira/browse/YARN-10531
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10531.001.patch, YARN-10531.002.patch, 
> YARN-10531.003.patch, YARN-10531.004.patch, YARN-10531.005.patch
>
>
> User limit factor is used to define max cap of how much resource can be 
> consumed by single user. 
> Under Auto Queue Creation context, it doesn't make much sense to set user 
> limit factor, because initially every queue will set weight to 1.0, we want 
> user can consume more resource if possible. It is hard to pre-determine how 
> to set up user limit factor. So it makes more sense to add a new value (like 
> -1) to indicate we will disable user limit factor 
> Logic need to be changed is below: 
> (Inside LeafQueue.java)
> {code}
> Resource maxUserLimit = Resources.none();
> if (schedulingMode == SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY) {
>   maxUserLimit = Resources.multiplyAndRoundDown(queueCapacity,
>   getUserLimitFactor());
> } else if (schedulingMode == SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY) 
> {
>   maxUserLimit = partitionResource;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10587) Fix AutoCreateLeafQueueCreation cap related caculation when in absolute mode.

2021-01-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268743#comment-17268743
 ] 

Wangda Tan commented on YARN-10587:
---

Thanks [~zhuqi],

I think the fix is correct, but I want [~sunilg] also take a look at the patch.

> Fix AutoCreateLeafQueueCreation cap related caculation when in absolute mode.
> -
>
> Key: YARN-10587
> URL: https://issues.apache.org/jira/browse/YARN-10587
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10587.001.patch
>
>
> When introduced YARN-10504.
> The logic related to auto created leaf queue changed.
> The test in testAutoCreateLeafQueueCreation failed, we should fix the Error.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10578) Fix Auto Queue Creation parent handling

2021-01-18 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267508#comment-17267508
 ] 

Wangda Tan commented on YARN-10578:
---

+1 to the latest patch, submitted patch to trigger Jenkins.

> Fix Auto Queue Creation parent handling
> ---
>
> Key: YARN-10578
> URL: https://issues.apache.org/jira/browse/YARN-10578
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10578.001.patch
>
>
> YARN-10506 introduced the new auto queue creation logic, however a parent == 
> null check in CapacityScheduler#autoCreateLeafQueue will prevent a two levels 
> queue to be created. We need to revert it back to the normal logic, also, we 
> should wrap the auto queue handling with a lock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-01-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266391#comment-17266391
 ] 

Wangda Tan commented on YARN-10532:
---

[~zhuqi], thanks for the patch, I took a brief look (nothing in detail yet). I 
think here're my overall thoughts: 

1) Instead of changing GuaranteedOrZeroCapacityOverTimePolicy, I suggest 
creating a new Policy (maybe we can make it runnable by default so we don't 
have to create another config). The policy just simply monitor queue last used 
time and delete queues when needed. 

2) The latest patch only remove queue when reinitialize queue is called, it is 
not a frequent used method, we have cluster which doesn't have queue 
reinitialize called for a long period of time. Can we just call it from the 
policy (then we may need to introduce a method in CapacityScheduler to delete 
queue). 

3) When we delete a queue, we need to check inside Scheduler to make sure 1) 
there's nothing running in the queue; 2) last usage timestamp is expired. We 
need to do this to avoid racing condition that Policy think a queue is 
deletable but Scheduler doesn't. 

4) Policy modifies "expiredQueue" field, I suggest to make Policy just read the 
state, and let Scheduler delete it. 

5) An additional requirement we should keep in mind: 

Scenario A:
{code:java}
- At time T0, policy signals scheduler to delete queue A (an auto created 
queue). 
- Before the signal arrives to scheduler, an app submitted to scheduler (T1). 
T1 > T0
- When at T2 (T2 > T1), the signal arrived at scheduler, scheduler should avoid 
removing the queue A because now it is used.{code}
Scenario B:
{code:java}
- At time T0, policy signals scheduler to delete queue A (an auto created 
queue).
- At T1 (T1 > T0), scheduler got the signal and deleted the queue.
- At T2 (T2 > T1), an app submitted to scheduler.

Scheduler should immediately recreate the queue, in another word, deleting an 
dynamic queue should NEVER fail a submitted application.{code}
 

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
> 
>
> Key: YARN-10532
> URL: https://issues.apache.org/jira/browse/YARN-10532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10532.001.patch
>
>
> It's better if we can delete auto-created queues when they are not in use for 
> a period of time (like 5 mins). It will be helpful when we have a large 
> number of auto-created queues (e.g. from 500 users), but only a small subset 
> of queues are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266385#comment-17266385
 ] 

Wangda Tan commented on YARN-10506:
---

Verified the test failure is an intermittent one which is not caused by the 
patch, but I still added a minor change to call mockRM.shutdown to cleanup 
mockRM before starting a new one (see ver.17). 

Committed the ver.17 patch to trunk. Thanks for the contribution from [~zhuqi] 
and [~gandras]! And really appreciate reviews from [~shuzirra], [~pbacsko], and 
[~bteke]! 

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506-010.patch, 
> YARN-10506-012.patch, YARN-10506-013.patch, YARN-10506.001.patch, 
> YARN-10506.002.patch, YARN-10506.003.patch, YARN-10506.004.patch, 
> YARN-10506.005.patch, YARN-10506.006-combined.patch, YARN-10506.006.patch, 
> YARN-10506.007.patch, YARN-10506.009.patch, YARN-10506.011.patch, 
> YARN-10506.014.patch, YARN-10506.015.patch, YARN-10506.016.patch, 
> YARN-10506.017.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-15 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10506:
--
Attachment: YARN-10506.017.patch

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506-010.patch, 
> YARN-10506-012.patch, YARN-10506-013.patch, YARN-10506.001.patch, 
> YARN-10506.002.patch, YARN-10506.003.patch, YARN-10506.004.patch, 
> YARN-10506.005.patch, YARN-10506.006-combined.patch, YARN-10506.006.patch, 
> YARN-10506.007.patch, YARN-10506.009.patch, YARN-10506.011.patch, 
> YARN-10506.014.patch, YARN-10506.015.patch, YARN-10506.016.patch, 
> YARN-10506.017.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10531) Be able to disable user limit factor for CapacityScheduler Leaf Queue

2021-01-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266294#comment-17266294
 ] 

Wangda Tan commented on YARN-10531:
---

[~zhuqi], I haven't reviewed details of the patch yet, I'm not sure if we need 
rebase after YARN-10506. Please share your thoughts so we can move it forward.

> Be able to disable user limit factor for CapacityScheduler Leaf Queue
> -
>
> Key: YARN-10531
> URL: https://issues.apache.org/jira/browse/YARN-10531
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10531.001.patch, YARN-10531.002.patch
>
>
> User limit factor is used to define max cap of how much resource can be 
> consumed by single user. 
> Under Auto Queue Creation context, it doesn't make much sense to set user 
> limit factor, because initially every queue will set weight to 1.0, we want 
> user can consume more resource if possible. It is hard to pre-determine how 
> to set up user limit factor. So it makes more sense to add a new value (like 
> -1) to indicate we will disable user limit factor 
> Logic need to be changed is below: 
> (Inside LeafQueue.java)
> {code}
> Resource maxUserLimit = Resources.none();
> if (schedulingMode == SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY) {
>   maxUserLimit = Resources.multiplyAndRoundDown(queueCapacity,
>   getUserLimitFactor());
> } else if (schedulingMode == SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY) 
> {
>   maxUserLimit = partitionResource;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266292#comment-17266292
 ] 

Wangda Tan commented on YARN-10506:
---

Thanks for the additional comments, I +1 to the latest patch, if no objections 
I will get it in by this afternoon. 

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506-010.patch, 
> YARN-10506-012.patch, YARN-10506-013.patch, YARN-10506.001.patch, 
> YARN-10506.002.patch, YARN-10506.003.patch, YARN-10506.004.patch, 
> YARN-10506.005.patch, YARN-10506.006-combined.patch, YARN-10506.006.patch, 
> YARN-10506.007.patch, YARN-10506.009.patch, YARN-10506.011.patch, 
> YARN-10506.014.patch, YARN-10506.015.patch, YARN-10506.016.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-14 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265652#comment-17265652
 ] 

Wangda Tan commented on YARN-10506:
---

Thanks [~zhuqi], the latest patch looks good to me!

I will wait for thoughts from others, if no additional opposite opinions I will 
get it in by tomorrow.

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506-010.patch, 
> YARN-10506-012.patch, YARN-10506-013.patch, YARN-10506.001.patch, 
> YARN-10506.002.patch, YARN-10506.003.patch, YARN-10506.004.patch, 
> YARN-10506.005.patch, YARN-10506.006-combined.patch, YARN-10506.006.patch, 
> YARN-10506.007.patch, YARN-10506.009.patch, YARN-10506.011.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-14 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265521#comment-17265521
 ] 

Wangda Tan commented on YARN-10506:
---

Sounds good, thanks [~shuzirra], [~pbacsko]. It makes sense to me, [~zhuqi] can 
you add your thoughts about the latest proposal?  

Thanks,

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506-010.patch, 
> YARN-10506-012.patch, YARN-10506.001.patch, YARN-10506.002.patch, 
> YARN-10506.003.patch, YARN-10506.004.patch, YARN-10506.005.patch, 
> YARN-10506.006-combined.patch, YARN-10506.006.patch, YARN-10506.007.patch, 
> YARN-10506.009.patch, YARN-10506.011.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-14 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265149#comment-17265149
 ] 

Wangda Tan commented on YARN-10506:
---

[~pbacsko], I see, I think it makes sense.
{quote}The logic must be there, otherwise we can't proceed to the next rule if 
the fallback is set to "skip". If you perform the validations (which are 
currently done by {{CSMappingPlacementRule}}) outside of the placement engine, 
you lose the ability to execute different actions (skip/reject/place to 
default). 
{quote}
Because it is different from the previous CS queue mapping: previously, when a 
rule matches, it up to the scheduler to make a decision such as create queue, 
accept or reject the app. 

Now we need to make sure rule engine makes the right decision before sending 
the request to scheduler. We need to be careful about possible racing 
conditions. But I think you're right, when admin change 
.auto-queue-creation-v2.enabled in the queue hierarchy, it is possible 
to cause undesirable result. But syncing between the two systems (rule engine 
and scheduler) will have delays in any case, so removing create flag from 
ApplicationPlacementContext sounds not bad. 

I will let [~zhuqi], [~gandras] and you guys to decide what is the best 
approach to take, I'm OK with either approaches. But let's reach a conclusion 
fast since this patch blocks a number of follow up tickets.

 

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506-010.patch, 
> YARN-10506-012.patch, YARN-10506.001.patch, YARN-10506.002.patch, 
> YARN-10506.003.patch, YARN-10506.004.patch, YARN-10506.005.patch, 
> YARN-10506.006-combined.patch, YARN-10506.006.patch, YARN-10506.007.patch, 
> YARN-10506.009.patch, YARN-10506.011.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-14 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265085#comment-17265085
 ] 

Wangda Tan commented on YARN-10506:
---

Thanks [~shuzirra], [~pbacsko], posting this comment while in the middle of 
something else, I hope I didn't miss important information. 

Here you shared several things: 

 

*1) Add a create flag on the rule itself in the JSON:* 

I agreed. 

*2) yarh.scheduler.capacity..auto-queue-creation-v2.enabled*

I agreed. 

Also, I agree we should look at if we can make the property inheritable or not, 
I agree but I suggest to move the implementation to a separate Jira.

*3) Removing "create" from the ApplicationPlacementContext:* 

I initially thought it is needed because the rule engine may not do exhaust 
checks and the concurrency issue. _(It is possible when the rule checks, the 
queue allow creation, but when ApplicationPlacementContext arrived at 
scheduler, the queue is refreshed)_. So I think it is still valuable to let the 
scheduler do the creation based on demand, and fail the App submission 
atomically. 

Also, if we relies on Scheduler to check creatable or not, it can reduce 
complexities of RuleEngine to do additional check. RuleEngine can just pass 
ApplicationPLacementContext to Scheduler and let scheduler making the decision.

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506-010.patch, 
> YARN-10506-012.patch, YARN-10506.001.patch, YARN-10506.002.patch, 
> YARN-10506.003.patch, YARN-10506.004.patch, YARN-10506.005.patch, 
> YARN-10506.006-combined.patch, YARN-10506.006.patch, YARN-10506.007.patch, 
> YARN-10506.009.patch, YARN-10506.011.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-13 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264661#comment-17264661
 ] 

Wangda Tan commented on YARN-10506:
---

Thanks [~zhuqi], I don't have further comments, +1.  [~gandras] can you share 
your thoughts on the latest patch?

If no further objections, I plan to get the patch committed by tomorrow.

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506-010.patch, 
> YARN-10506-012.patch, YARN-10506.001.patch, YARN-10506.002.patch, 
> YARN-10506.003.patch, YARN-10506.004.patch, YARN-10506.005.patch, 
> YARN-10506.006-combined.patch, YARN-10506.006.patch, YARN-10506.007.patch, 
> YARN-10506.009.patch, YARN-10506.011.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-13 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264397#comment-17264397
 ] 

Wangda Tan commented on YARN-10506:
---

Reg. *2) How we deal with the queue's auto-queue-creation configuration flag?*

Can we rename the property to {{queue-path.auto-queue-creation-v2.enabled}} ? 
I'm looking for an approach to more distinguished from the older one.

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506-010.patch, 
> YARN-10506.001.patch, YARN-10506.002.patch, YARN-10506.003.patch, 
> YARN-10506.004.patch, YARN-10506.005.patch, YARN-10506.006-combined.patch, 
> YARN-10506.006.patch, YARN-10506.007.patch, YARN-10506.009.patch, 
> YARN-10506.011.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-13 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264390#comment-17264390
 ] 

Wangda Tan commented on YARN-10506:
---

[~zhuqi], [~gandras], 

I just took a look at the latest patch, here's my comment: 

I think we still need to make a conclusion for the following items: 

*1) How we deal with "create" flag of ApplicationPlacementContext?* 

Based on latest patch, we have two flags added to ApplicationPlacementContext. 
But we only do one 
{code:java}
 if (apc.isCreateLeafQueue()
|| apc.isCreateParentQueue()) {
...
LeafQueue lq =
autoQueueHandler.autoCreateQueue(apc);
} {code}
And we hardcoded the two values: 
{code:java}
apc.setCreateParentQueue(true);
apc.setCreateLeafQueue(true); {code}
To me, It is not sufficient, we need to check inside the handler:
{code:java}
if (apc.isCreateParentQueue()) {
  createParentQueue()
}
if (apc.isCreatedLeafQueue()) {
  createLeafQueue()
}
 {code}
We should add tests for that because it is contract for future integration, we 
should have the following test cases: 
{code:java}
 1) when createLeaf = false, createParent = false: 
1.1 When both Leaf doesn't exist or Parent doesn't exist: Application will 
be rejected.
1.2 When Parent exists but Leaf doesnt't exist: Application will be 
rejected. 
1.3 When both exists, application will be accepted

2) Other combinations ..{code}
If we can abstract common test functionality, we should be able to do the 
testing without too much-duplicated code.

Can we do it with this patch? *I don't want to delay this (to a separate Jira) 
because once another feature integration happens (such as from Queue placement 
policy), we will face issues and will cause further delays.*

*2) How we deal with the queue's auto-queue-creation configuration flag?*

I think we can create a flag for c-s.xml to enable auto create queue for each 
parent now, but I felt we need to change it later. As far as we get 
functionality correct, I'm OK with pushing this to a follow-up patch.

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506-010.patch, 
> YARN-10506.001.patch, YARN-10506.002.patch, YARN-10506.003.patch, 
> YARN-10506.004.patch, YARN-10506.005.patch, YARN-10506.006-combined.patch, 
> YARN-10506.006.patch, YARN-10506.007.patch, YARN-10506.009.patch, 
> YARN-10506.011.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-13 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264238#comment-17264238
 ] 

Wangda Tan commented on YARN-10506:
---

Can we also take care of javac warnings?

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506-010.patch, 
> YARN-10506.001.patch, YARN-10506.002.patch, YARN-10506.003.patch, 
> YARN-10506.004.patch, YARN-10506.005.patch, YARN-10506.006-combined.patch, 
> YARN-10506.006.patch, YARN-10506.007.patch, YARN-10506.009.patch, 
> YARN-10506.011.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-13 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264237#comment-17264237
 ] 

Wangda Tan commented on YARN-10506:
---

Thanks [~zhuqi] and [~gandras] for quick updates! 

*I will review the latest patch  (in detail) during my day time*, and to the 
question: 
{quote} However, I think this method should not take into account which mode 
the parent is in, this should be handled outside of this. I think it is safe to 
assume, that for empty queues just return WEIGHT, because it is more 
restrictive, than PERCENTAGE mode. Lets wait for the opinion of [~wangda] about 
it as well
{quote}
I would agree with the statement. We can revisit how we can do better to 
distinguish WEIGHT, PERCENTAGE, ABS configuration, which needs additional 
cleanup and refactoring. So far, I think it is good for this patch. 

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506-010.patch, 
> YARN-10506.001.patch, YARN-10506.002.patch, YARN-10506.003.patch, 
> YARN-10506.004.patch, YARN-10506.005.patch, YARN-10506.006-combined.patch, 
> YARN-10506.006.patch, YARN-10506.007.patch, YARN-10506.009.patch, 
> YARN-10506.011.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-12 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263672#comment-17263672
 ] 

Wangda Tan commented on YARN-10506:
---

Thanks for all updates, here's my comments: 

 

1) CapacityScheduler:

Minor:
autoCreateLeafQueue should be moved to autoQueueHandler. 
We can do this in a follow up patch, because now we have two places to handle 
the auto queue creation, and ideally "autoQueueHandler" should be responsible 
for that. 
The same follow up patch should also clean up addQueue() method of 
ResourceScheduler. It is only used by CapacitySchedulerPlanFollow, we don't 
need to add it to the abstract class.

{code}
LeafQueue lq = autoQueueHandler.autoCreateQueuePath(placementContext);
{code} 
Return value is not used.


2) CapacitySchedulerAutoQueueHandler:

Major:

2.1) I think the existing logic is to create auto queue based on 
ApplicationPlacementContext, but when we do queue mapping, we need to indicate 
if auto creation is allowed or not.
In the mapping rule, we have "create" flag, I think CSAutoQueueHandler should 
consider the "create" flag.

Previously, in my patch, I added createLeafQueue and createParentQueue flag, 
the latest patch removed the logic.

Can you share more thoughts on this? Any please let me know if I missed 
anything.

Even though this logic can be done in a separate patch, but I still suggest to 
handle it within this one for completeness.

(A side note, I noticed there's a method: 
ParentQueue#isEligibleForAutoQueueCreation, it only handles "if a parent queue 
can allow auto creation underneath or not", but cannot handle "if a creation 
itself is allowed by placement policy or not").

2.2) autoCreateQueuePath doesn't look like atomic, it could create parent 
first, but later found LeafQueue cannot be created:

{code}
if (parent instanceof ParentQueue) {
...
} else {
throw new SchedulerDynamicEditException(
"Could not auto-create leaf queue for " + queue.getQueue()
+ ". Queue mapping specifies an invalid parent queue "
+ "which does not exist"
+ queue.getParentQueue());
}
{code}

Can we make it atomic?

2.3) I think autoCreateParentHierarchy itself should be able to handle 
LeafQueue creation, because itself can handle multiple leaves, we don't have to 
maintain a separate LeafQueue creation logic: 
{code}
ParentQueue parentQueue = (ParentQueue) parent;
LeafQueue leafQueue = parentQueue.addDynamicLeafQueue(
queue.getFullQueuePath());
queueManager.addQueue(leafQueue.getQueuePath(), leafQueue);

return leafQueue; 
{code}

Minor:
- Rename autoCreateQueuePath to autoCreateQueue (we will never create a "queue 
path") 
- CSQueueUtils#extractQueuePath should be moved to CSAutoQueueHandler. (It 
won't be used by other classes). And rename extractQueuePath to 
extractApplicationPlacementContext (we don't just extract path)

3) CapacitySchedulerConfiguration change:

Major:
- Now we added queue-path.auto-queue-creation.enabled flag, which is in 
parallel of auto-create-child-queue.enabled flag.

First of all, it is very confusing because there're two parameters looks 
similar.
Second, We still need to check the weight mode is configured or not.
I'm actually thinking to get rid of the flag, and completely relying on 
ParentQueue#addDynamicChildQueue to do the weights check: If a parent queue has 
children use weight, we can proceed with queue creation.
We can improve the flag later, thoughts?

4) ParentQueue#addDynamicChildQueue:

Major: 
- The logic ParentQueue#getCapacityConfigurationTypeForQueues returns PERCENT 
when there's no children under the parent. For that case, I think we should 
specially handle it inside getCapacityConfigurationTypeForQueues: 
{code}
Which should return WEIGHT when Collection queues isEmpty
{code}

And we should add an unit test to add queue under a static parent queue which 
doesn't have children, because it will be a common case when user use the 
feature.

 

 

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506.001.patch, 
> YARN-10506.002.patch, YARN-10506.003.patch, YARN-10506.004.patch, 
> YARN-10506.005.patch, YARN-10506.006-combined.patch, YARN-10506.006.patch, 
> YARN-10506.007.patch, YARN-10506.009.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations

2021-01-12 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263575#comment-17263575
 ] 

Wangda Tan commented on YARN-10564:
---

Thanks [~zhuqi], let's get YARN-10506 done shortly (in a day or two), and we 
can then move on to other patches including this one, auto-delete queue, etc.

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10564) Support Auto Queue Creation template configurations

2021-01-12 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-10564:
-

Assignee: zhuqi  (was: Andras Gyori)

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10535) Make changes in queue placement policy to use auto-queue-placement API in CapacityScheduler

2021-01-12 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-10535:
-

Assignee: Gergely Pollak

> Make changes in queue placement policy to use auto-queue-placement API in 
> CapacityScheduler
> ---
>
> Key: YARN-10535
> URL: https://issues.apache.org/jira/browse/YARN-10535
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Gergely Pollak
>Priority: Major
>
> Once YARN-10506 is done, we need to call the API from the queue placement 
> policy to create queues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-12 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263457#comment-17263457
 ] 

Wangda Tan commented on YARN-10506:
---

[~zhuqi], [~gandras] can you check the findbugs warning?

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506-006-10504-010.patch, 
> YARN-10506-007-10504-010.patch, YARN-10506-008.patch, YARN-10506.001.patch, 
> YARN-10506.002.patch, YARN-10506.003.patch, YARN-10506.004.patch, 
> YARN-10506.005.patch, YARN-10506.006-combined.patch, YARN-10506.006.patch, 
> YARN-10506.007.patch, YARN-10506.009.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263018#comment-17263018
 ] 

Wangda Tan edited comment on YARN-10504 at 1/12/21, 1:58 AM:
-

Committed ver.010 to trunk, thanks to everybody who contributed code ([~bteke], 
[~zhuqi], [~gandras]) and reviewed the patch ([~sunilg], [~epayne])!


was (Author: wangda):
Committed to trunk, thanks to everybody who contributed code ([~bteke], 
[~zhuqi], [~gandras]) and reviewed the patch ([~sunilg], [~epayne])!

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.009.patch, YARN-10504.010.patch, YARN-10504.011.patch, 
> YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-11 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10504:
--
Fix Version/s: 3.4.0

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.009.patch, YARN-10504.010.patch, YARN-10504.011.patch, 
> YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263018#comment-17263018
 ] 

Wangda Tan commented on YARN-10504:
---

Committed to trunk, thanks to everybody who contributed code ([~bteke], 
[~zhuqi], [~gandras]) and reviewed the patch ([~sunilg], [~epayne])!

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.009.patch, YARN-10504.010.patch, YARN-10504.011.patch, 
> YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2021-01-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262985#comment-17262985
 ] 

Wangda Tan commented on YARN-10497:
---

[~shuzirra], [~pbacsko] can you review the latest patch when you get a chance?

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-10497.001.patch, YARN-10497.002.patch, 
> YARN-10497.003.patch, YARN-10497.004.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262853#comment-17262853
 ] 

Wangda Tan commented on YARN-10504:
---

It looks like folks are generally OK with getting this patch in and deal with 
further clean up for mixed config mode in a follow up Jira. +_I plan to get it 
in by today my time. 

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.009.patch, YARN-10504.010.patch, YARN-10504.011.patch, 
> YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262853#comment-17262853
 ] 

Wangda Tan edited comment on YARN-10504 at 1/11/21, 6:35 PM:
-

It looks like folks are generally OK with getting this patch in and deal with 
further clean up for mixed config mode in a follow-up Jira, it looks LGTM. I 
plan to get it in by today my time. 


was (Author: wangda):
It looks like folks are generally OK with getting this patch in and deal with 
further clean up for mixed config mode in a follow up Jira. +_I plan to get it 
in by today my time. 

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.009.patch, YARN-10504.010.patch, YARN-10504.011.patch, 
> YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10559) Fair sharing intra-queue preemption support in Capacity Scheduler

2021-01-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262813#comment-17262813
 ] 

Wangda Tan commented on YARN-10559:
---

Removed fix version of the Jira.

> Fair sharing intra-queue preemption support in Capacity Scheduler
> -
>
> Key: YARN-10559
> URL: https://issues.apache.org/jira/browse/YARN-10559
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 3.1.4
>Reporter: VADAGA ANANYO RAO
>Assignee: VADAGA ANANYO RAO
>Priority: Major
> Attachments: FairOP_preemption-design_doc_v1.pdf, 
> FairOP_preemption-design_doc_v2.pdf, YARN-10559.0001.patch, 
> YARN-10559.0002.patch, YARN-10559.0003.patch, YARN-10559.0004.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Usecase:
> Due to the way Capacity Scheduler preemption works, If a single user submits 
> a large application to a queue (using 100% of resources), that job will not 
> be preempted by future applications from the same user within the same queue. 
> This implies that the later applications will be forced to wait for 
> completion of the long running application. This prevents multiple long 
> running, large, applications from running concurrently.
> Support fair sharing among apps while preempting applications from same queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10559) Fair sharing intra-queue preemption support in Capacity Scheduler

2021-01-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262813#comment-17262813
 ] 

Wangda Tan edited comment on YARN-10559 at 1/11/21, 5:21 PM:
-

Removed fix version of the Jira. (We should only set it when patch got 
committed).


was (Author: wangda):
Removed fix version of the Jira.

> Fair sharing intra-queue preemption support in Capacity Scheduler
> -
>
> Key: YARN-10559
> URL: https://issues.apache.org/jira/browse/YARN-10559
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 3.1.4
>Reporter: VADAGA ANANYO RAO
>Assignee: VADAGA ANANYO RAO
>Priority: Major
> Attachments: FairOP_preemption-design_doc_v1.pdf, 
> FairOP_preemption-design_doc_v2.pdf, YARN-10559.0001.patch, 
> YARN-10559.0002.patch, YARN-10559.0003.patch, YARN-10559.0004.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Usecase:
> Due to the way Capacity Scheduler preemption works, If a single user submits 
> a large application to a queue (using 100% of resources), that job will not 
> be preempted by future applications from the same user within the same queue. 
> This implies that the later applications will be forced to wait for 
> completion of the long running application. This prevents multiple long 
> running, large, applications from running concurrently.
> Support fair sharing among apps while preempting applications from same queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10559) Fair sharing intra-queue preemption support in Capacity Scheduler

2021-01-11 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10559:
--
Fix Version/s: (was: 3.1.4)

> Fair sharing intra-queue preemption support in Capacity Scheduler
> -
>
> Key: YARN-10559
> URL: https://issues.apache.org/jira/browse/YARN-10559
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 3.1.4
>Reporter: VADAGA ANANYO RAO
>Assignee: VADAGA ANANYO RAO
>Priority: Major
> Attachments: FairOP_preemption-design_doc_v1.pdf, 
> FairOP_preemption-design_doc_v2.pdf, YARN-10559.0001.patch, 
> YARN-10559.0002.patch, YARN-10559.0003.patch, YARN-10559.0004.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Usecase:
> Due to the way Capacity Scheduler preemption works, If a single user submits 
> a large application to a queue (using 100% of resources), that job will not 
> be preempted by future applications from the same user within the same queue. 
> This implies that the later applications will be forced to wait for 
> completion of the long running application. This prevents multiple long 
> running, large, applications from running concurrently.
> Support fair sharing among apps while preempting applications from same queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262751#comment-17262751
 ] 

Wangda Tan commented on YARN-10504:
---

Thanks [~zhuqi], [~bteke] and review from [~sunilg].  

I suggest let's go ahead with ver.010, and deal with the abs resource in a 
follow-up Jira. This Jira is already very large and we need more clean up for 
the absolute resource + weight + percent rather than just fixing the problem 
itself. Also, this Jira blocked a number of efforts such as YARN-10506. 

Thoughts?

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.009.patch, YARN-10504.010.patch, YARN-10504.011.patch, 
> YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-10 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262394#comment-17262394
 ] 

Wangda Tan commented on YARN-10504:
---

Attached ver.010, it should be able to fix all test failures.

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.009.patch, YARN-10504.010.patch, YARN-10504.ver-1.patch, 
> YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-10 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10504:
--
Attachment: (was: YARN-10504.010.patch)

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.009.patch, YARN-10504.010.patch, YARN-10504.ver-1.patch, 
> YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-10 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10504:
--
Attachment: YARN-10504.010.patch

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.009.patch, YARN-10504.010.patch, YARN-10504.ver-1.patch, 
> YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-10 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10504:
--
Attachment: YARN-10504.010.patch

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.009.patch, YARN-10504.010.patch, YARN-10504.ver-1.patch, 
> YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-10 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262323#comment-17262323
 ] 

Wangda Tan commented on YARN-10506:
---

Attached ver.6 combined patch (which includes YARN-10506.005 patch, plus 
YARN-10504.009 patch), which will trigger Jenkins and see how the run looks 
like.

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506.001.patch, YARN-10506.002.patch, 
> YARN-10506.003.patch, YARN-10506.004.patch, YARN-10506.005.patch, 
> YARN-10506.006-combined.patch, YARN-10506.006.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-10 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10506:
--
Attachment: YARN-10506.006-combined.patch

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506.001.patch, YARN-10506.002.patch, 
> YARN-10506.003.patch, YARN-10506.004.patch, YARN-10506.005.patch, 
> YARN-10506.006-combined.patch, YARN-10506.006.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-10 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262321#comment-17262321
 ] 

Wangda Tan commented on YARN-10504:
---

Uploaded ver.009, fixed more test failures. I run it locally and I think most 
test cases should pass. Let's see how Jenkins's result looks like.

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.009.patch, YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, 
> YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-10 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10504:
--
Attachment: YARN-10504.009.patch

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.009.patch, YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, 
> YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-10 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262302#comment-17262302
 ] 

Wangda Tan commented on YARN-10506:
---

Manually rebased patch on top of the latest YARN-10504 (008). Haven't run any 
tests, just wanted to see if there are any significant conflicts (so far no). 

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506.001.patch, YARN-10506.002.patch, 
> YARN-10506.003.patch, YARN-10506.004.patch, YARN-10506.005.patch, 
> YARN-10506.006.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2021-01-10 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10506:
--
Attachment: YARN-10506.006.patch

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506.001.patch, YARN-10506.002.patch, 
> YARN-10506.003.patch, YARN-10506.004.patch, YARN-10506.005.patch, 
> YARN-10506.006.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-10 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262301#comment-17262301
 ] 

Wangda Tan commented on YARN-10504:
---

Still see some issues around auto created leaf queue, fixed a number of issues 
introduced by the latest changes in ParentQueue.setChildQueues. 

Unit tests: 
- TestRMWebServicesForCSWithPartitions: 
 [FIXED] There's bad assumption for #resourceByPartition returned. When the 
queue is initialized, there's no usage thus there should have no 
resourceByPartition. Updated tests to not test #resourceByPartition.

- TestRMWebServices: 
 [FIXED] testValidateAndGetSchedulerConfigurationInvalidConfig: This is a bad 
test which checks returned plain text of exception, we should avoid do that. 
 fix: Changed to just check it is an IOException instead of checking the actual 
text.
 * TestAbsoluteResourceConfiguration
 [PENDING] testSimpleMinMaxResourceConfigurartionPerQueue: 
 this is still pending, because now AutoCreatedLeafQueue depends on 
mergeCapacities, which calculate capacity first, then use the capacity to 
calculate absolute min resource. 
 It can cause a change in effectiveMinResource value even if cluster has 
sufficient resource. 
 We should look into this once we finish the YARN-10506. Now temporarily update 
test to use the wrong value to unblock tests.
 fixed rest issues.

Will continue investigate.

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-10 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10504:
--
Attachment: YARN-10504.008.patch

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.008.patch, 
> YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-10 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262257#comment-17262257
 ] 

Wangda Tan commented on YARN-10504:
---

Thanks [~bteke], I'm checking the failures now.

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.007.patch, YARN-10504.ver-1.patch, 
> YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-09 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10504:
--
Attachment: YARN-10504.006.patch

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.006.patch, YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, 
> YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-09 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261943#comment-17261943
 ] 

Wangda Tan commented on YARN-10504:
---

Updated ver.6 patch, which includes the following: 

1) Unit tests to test end to end capability and mixed percentage/weight mode. 

2) Rewrite ParentQueue.setChildQueues. It was a bit messy before. Now it 
enforced more strict checks, and rewrote check statements for better 
readability.

[~zhuqi]/[~bteke] /[~gandras] , I haven't addressed your comment, so it will be 
nice if you can help to make the changes. (And again, please add a comment if 
you plan to do that to avoid editing the same code).

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-09 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261921#comment-17261921
 ] 

Wangda Tan commented on YARN-10504:
---

Make sense [~zhuqi].

[~bteke]/[~gandras] , if you're not making changes to the patch, [~zhuqi] can 
you take care of the issue? 

I plan to add a few more test case coverages for the weight mode 
today/tomorrow. I will not touch any existing logic. 

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-08 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261774#comment-17261774
 ] 

Wangda Tan commented on YARN-10504:
---

1) AbstractCSQueue, lots of format changes, can we revert all format change 
(change spaces). Since it will cause issues when backport and reviews 

2) LeafQueue: 
There're two TODOs: 
  //TODO recalculate max applications because they can depend on capacity
  //TODO recalculate max applications because they can depend on capacity

I think it took care by the {{updateAbsoluteCapacitiesAndRelatedFields}}, 
please double check, and if it is correct, we should remove the TODOs

3) TestAbsoluteResourceWithAutoQueue: 

I added a TODO: 
  TODO: Wangda: I think this test case is not correct, Sunil could help look
  // into details.

4) TestCapacitySchedulerAutoCreatedQueueBase.java

I also added a TODO
// TODO: Wangda, I think this is a wrong test, it doesn't consider rounding
...

[~bteke], [~gandras], [~zhuqi] if any of you plan to make change to the patch, 
please add a comment to the JIRA so others will know, otherwise the patch will 
be hard to merge.

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.003.patch, YARN-10504.004.patch, YARN-10504.005.patch, 
> YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2021-01-06 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260114#comment-17260114
 ] 

Wangda Tan commented on YARN-10504:
---

[~gandras],

Thanks for updating this, your observation is correct: In the original patch I 
moved calculation of absolute percentage from config parsing phase to refresh 
queue phase. Why I did that? 

- Because of the weight mode will impact relative percentage, queue parsing 
phase is no longer able to get all percentages of queues in one shot. Even 
though we can normalize weight values during parsing phase, however, if we have 
dynamic queues under the same parent, parsing phase cannot get all percentage 
correctly calcuated. 

- The previous calculation of config properties are bit ad-hoc: Some values are 
calculated during setupQueueConfigs, some values calculated in 
reinitialization, some values calculated in  updateClusterResource. The patch I 
moved many of the config properties which needs runtime calculation (such as 
absolute percentage) to updateCluserResources. (And removed from 
setupQueueConfig) 

- I noticed failure of AutoCreatedLeafQueue, but I didn't get a chance to dig 
more for the root cause, sjince updateClusterResource will be called for 
AutoCreatedLeafQueue, I don't know why it is still failing. Is it a test issue 
or production issue? (Please note that updateClusterResource of CSQueue will be 
called during CS init, reinit, and resource changes).

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.002.patch, 
> YARN-10504.ver-1.patch, YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10535) Make changes in queue placement policy to use auto-queue-placement API in CapacityScheduler

2020-12-15 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10535:
-

 Summary: Make changes in queue placement policy to use 
auto-queue-placement API in CapacityScheduler
 Key: YARN-10535
 URL: https://issues.apache.org/jira/browse/YARN-10535
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacity scheduler
Reporter: Wangda Tan


Once YARN-10506 is done, we need to call the API from the queue placement 
policy to create queues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10535) Make changes in queue placement policy to use auto-queue-placement API in CapacityScheduler

2020-12-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250003#comment-17250003
 ] 

Wangda Tan commented on YARN-10535:
---

cc: [~shuzirra], [~pbacsko]

> Make changes in queue placement policy to use auto-queue-placement API in 
> CapacityScheduler
> ---
>
> Key: YARN-10535
> URL: https://issues.apache.org/jira/browse/YARN-10535
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Priority: Major
>
> Once YARN-10506 is done, we need to call the API from the queue placement 
> policy to create queues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2020-12-15 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-10506:
-

Assignee: Andras Gyori

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506.001.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-12-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249870#comment-17249870
 ] 

Wangda Tan commented on YARN-10295:
---

[~snemeth], just come across the issue, I saw a bunch of failed unit tests in 
the above jenkins output, is it patch related?

> CapacityScheduler NPE can cause apps to get stuck without resources
> ---
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.2.2, 3.1.5
>
> Attachments: YARN-10295.001.branch-3.1.patch, 
> YARN-10295.001.branch-3.2.patch, YARN-10295.002.branch-3.1.patch, 
> YARN-10295.002.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled and log level 
> is set to DEBUG there is an edge-case where a NullPointerException can cause 
> the scheduler thread to exit and the apps to get stuck without allocated 
> resources. Consider the following log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used= with 
> resource=
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used=, currently 
> has 0 at priority 11; currentReservation  on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
>  }
>  return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2020-12-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248261#comment-17248261
 ] 

Wangda Tan commented on YARN-10506:
---

Uploaded a patch, which is based on YARN-10504. It has done the following: 

- Handle create of leaf and parent. 
- Added control parameter to ApplicationPlacementContext. 
- Handle dynamic adjust weights for queues. 

Partically complete: 
- Handle convert of dynamic queue to static queue (still see some test 
failures). 

Not started: 
- Integrate with Queue Placement Policy and related tests. 

Have done unit tests for some part of logics, details see 
TestCapacitySchedulerNewQueueAutoCreation

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Priority: Major
> Attachments: YARN-10506.001.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2020-12-11 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10506:
--
Attachment: YARN-10506.001.patch

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Priority: Major
> Attachments: YARN-10506.001.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2020-12-11 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10532:
-

 Summary: Capacity Scheduler Auto Queue Creation: Allow auto delete 
queue when queue is not being used
 Key: YARN-10532
 URL: https://issues.apache.org/jira/browse/YARN-10532
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan


It's better if we can delete auto-created queues when they are not in use for a 
period of time (like 5 mins). It will be helpful when we have a large number of 
auto-created queues (e.g. from 500 users), but only a small subset of queues 
are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10531) Be able to disable user limit factor for CapacityScheduler Leaf Queue

2020-12-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248149#comment-17248149
 ] 

Wangda Tan commented on YARN-10531:
---

[~zhuqi] do you want to take a try on this one? 

Thanks, 

> Be able to disable user limit factor for CapacityScheduler Leaf Queue
> -
>
> Key: YARN-10531
> URL: https://issues.apache.org/jira/browse/YARN-10531
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Priority: Major
>
> User limit factor is used to define max cap of how much resource can be 
> consumed by single user. 
> Under Auto Queue Creation context, it doesn't make much sense to set user 
> limit factor, because initially every queue will set weight to 1.0, we want 
> user can consume more resource if possible. It is hard to pre-determine how 
> to set up user limit factor. So it makes more sense to add a new value (like 
> -1) to indicate we will disable user limit factor 
> Logic need to be changed is below: 
> (Inside LeafQueue.java)
> {code}
> Resource maxUserLimit = Resources.none();
> if (schedulingMode == SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY) {
>   maxUserLimit = Resources.multiplyAndRoundDown(queueCapacity,
>   getUserLimitFactor());
> } else if (schedulingMode == SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY) 
> {
>   maxUserLimit = partitionResource;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10531) Be able to disable user limit factor for CapacityScheduler Leaf Queue

2020-12-11 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10531:
-

 Summary: Be able to disable user limit factor for 
CapacityScheduler Leaf Queue
 Key: YARN-10531
 URL: https://issues.apache.org/jira/browse/YARN-10531
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan


User limit factor is used to define max cap of how much resource can be 
consumed by single user. 

Under Auto Queue Creation context, it doesn't make much sense to set user limit 
factor, because initially every queue will set weight to 1.0, we want user can 
consume more resource if possible. It is hard to pre-determine how to set up 
user limit factor. So it makes more sense to add a new value (like -1) to 
indicate we will disable user limit factor 

Logic need to be changed is below: 

(Inside LeafQueue.java)

{code}
Resource maxUserLimit = Resources.none();
if (schedulingMode == SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY) {
  maxUserLimit = Resources.multiplyAndRoundDown(queueCapacity,
  getUserLimitFactor());
} else if (schedulingMode == SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY) {
  maxUserLimit = partitionResource;
}
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler

2020-12-11 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10504:
--
Attachment: YARN-10504.ver-3.patch

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.ver-1.patch, 
> YARN-10504.ver-2.patch, YARN-10504.ver-3.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2020-12-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248114#comment-17248114
 ] 

Wangda Tan commented on YARN-10504:
---

[~zhuqi], thank you so much for your review. [~bteke] will take over the work 
from me, so [~bteke] can you continue work with [~zhuqi] to address comments? 

I just uploaded ver.3 patch, fixed a potential deadlock in 
AutoCreatedLeafQueue. 

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.ver-1.patch, 
> YARN-10504.ver-2.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10530) CapacityScheduler ResourceLimits doesn't handle node partition well

2020-12-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248087#comment-17248087
 ] 

Wangda Tan commented on YARN-10530:
---

I haven't written any UT yet, but I just want to file the ticket to make sure 
we take a closer look because the logic looks confusing. I will be delighted if 
this is a false alarm :) 

> CapacityScheduler ResourceLimits doesn't handle node partition well
> ---
>
> Key: YARN-10530
> URL: https://issues.apache.org/jira/browse/YARN-10530
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Wangda Tan
>Priority: Blocker
>
> This is a serious bug may impact all releases, I need to do further check but 
> I want to log the JIRA so we will not forget:  
> ResourceLimits objects are used to handle two purposes: 
> 1) When there's cluster resource change, for example adding new node, or 
> scheduler config reinitialize. We will pass ResourceLimits to 
> updateClusterResource to queues. 
> 2) When allocate container, we try to pass parent's available resource to 
> child to make sure child's resource allocation won't violate parent's max 
> resource. For example below: 
> {code}
> queue used  max
> --
> root  1020
> root.a8 10
> root.a.a1 2 10
> root.a.a2 6 10
> {code}
> Even though a.a1 has 8 resources headroom (a1.max - a1.used). But we can at 
> most allocate 2 resources to a1 because root.a's limit will hit first. This 
> information will be passed down from parent queue to child queue during 
> assignContainers call via ResourceLimits. 
> However, we only pass 1 ResourceLimits from top, for queue initialize, we 
> passed in: 
> {code}
> root.updateClusterResource(clusterResource, new ResourceLimits(
> clusterResource));
> {code}
> And when we update cluster resource, we only considered default partition
> {code}
>   // Update all children
>   for (CSQueue childQueue : childQueues) {
> // Get ResourceLimits of child queue before assign containers
> ResourceLimits childLimits = getResourceLimitsOfChild(childQueue,
> clusterResource, resourceLimits,
> RMNodeLabelsManager.NO_LABEL, false);
> childQueue.updateClusterResource(clusterResource, childLimits);
>   }
> {code}
> Same for allocation logic, we passed in: (Actually I found I added a TODO 
> item 5 years ago).
> {code}
> // Try to use NON_EXCLUSIVE
> assignment = getRootQueue().assignContainers(getClusterResource(),
> candidates,
> // TODO, now we only consider limits for parent for non-labeled
> // resources, should consider labeled resources as well.
> new ResourceLimits(labelManager
> .getResourceByLabel(RMNodeLabelsManager.NO_LABEL,
> getClusterResource())),
> SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY);
> {code} 
> The good thing is, in the assignContainers call, we calculated child limit 
> based on partition
> {code} 
> ResourceLimits childLimits =
>   getResourceLimitsOfChild(childQueue, cluster, limits,
>   candidates.getPartition(), true);
> {code} 
> So I think now the problem is, when a named partition has more resource than 
> default partition, effective min/max resource of each queue could be wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10530) CapacityScheduler ResourceLimits doesn't handle node partition well

2020-12-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248084#comment-17248084
 ] 

Wangda Tan commented on YARN-10530:
---

cc: [~sunilg], [~epayne]

> CapacityScheduler ResourceLimits doesn't handle node partition well
> ---
>
> Key: YARN-10530
> URL: https://issues.apache.org/jira/browse/YARN-10530
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Reporter: Wangda Tan
>Priority: Blocker
>
> This is a serious bug may impact all releases, I need to do further check but 
> I want to log the JIRA so we will not forget:  
> ResourceLimits objects are used to handle two purposes: 
> 1) When there's cluster resource change, for example adding new node, or 
> scheduler config reinitialize. We will pass ResourceLimits to 
> updateClusterResource to queues. 
> 2) When allocate container, we try to pass parent's available resource to 
> child to make sure child's resource allocation won't violate parent's max 
> resource. For example below: 
> {code}
> queue used  max
> --
> root  1020
> root.a8 10
> root.a.a1 2 10
> root.a.a2 6 10
> {code}
> Even though a.a1 has 8 resources headroom (a1.max - a1.used). But we can at 
> most allocate 2 resources to a1 because root.a's limit will hit first. This 
> information will be passed down from parent queue to child queue during 
> assignContainers call via ResourceLimits. 
> However, we only pass 1 ResourceLimits from top, for queue initialize, we 
> passed in: 
> {code}
> root.updateClusterResource(clusterResource, new ResourceLimits(
> clusterResource));
> {code}
> And when we update cluster resource, we only considered default partition
> {code}
>   // Update all children
>   for (CSQueue childQueue : childQueues) {
> // Get ResourceLimits of child queue before assign containers
> ResourceLimits childLimits = getResourceLimitsOfChild(childQueue,
> clusterResource, resourceLimits,
> RMNodeLabelsManager.NO_LABEL, false);
> childQueue.updateClusterResource(clusterResource, childLimits);
>   }
> {code}
> Same for allocation logic, we passed in: (Actually I found I added a TODO 
> item 5 years ago).
> {code}
> // Try to use NON_EXCLUSIVE
> assignment = getRootQueue().assignContainers(getClusterResource(),
> candidates,
> // TODO, now we only consider limits for parent for non-labeled
> // resources, should consider labeled resources as well.
> new ResourceLimits(labelManager
> .getResourceByLabel(RMNodeLabelsManager.NO_LABEL,
> getClusterResource())),
> SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY);
> {code} 
> The good thing is, in the assignContainers call, we calculated child limit 
> based on partition
> {code} 
> ResourceLimits childLimits =
>   getResourceLimitsOfChild(childQueue, cluster, limits,
>   candidates.getPartition(), true);
> {code} 
> So I think now the problem is, when a named partition has more resource than 
> default partition, effective min/max resource of each queue could be wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10530) CapacityScheduler ResourceLimits doesn't handle node partition well

2020-12-11 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10530:
-

 Summary: CapacityScheduler ResourceLimits doesn't handle node 
partition well
 Key: YARN-10530
 URL: https://issues.apache.org/jira/browse/YARN-10530
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, capacityscheduler
Reporter: Wangda Tan


This is a serious bug may impact all releases, I need to do further check but I 
want to log the JIRA so we will not forget:  

ResourceLimits objects are used to handle two purposes: 

1) When there's cluster resource change, for example adding new node, or 
scheduler config reinitialize. We will pass ResourceLimits to 
updateClusterResource to queues. 

2) When allocate container, we try to pass parent's available resource to child 
to make sure child's resource allocation won't violate parent's max resource. 
For example below: 

{code}
queue used  max
--
root  1020
root.a8 10
root.a.a1 2 10
root.a.a2 6 10
{code}

Even though a.a1 has 8 resources headroom (a1.max - a1.used). But we can at 
most allocate 2 resources to a1 because root.a's limit will hit first. This 
information will be passed down from parent queue to child queue during 
assignContainers call via ResourceLimits. 

However, we only pass 1 ResourceLimits from top, for queue initialize, we 
passed in: 

{code}
root.updateClusterResource(clusterResource, new ResourceLimits(
clusterResource));
{code}

And when we update cluster resource, we only considered default partition

{code}
  // Update all children
  for (CSQueue childQueue : childQueues) {
// Get ResourceLimits of child queue before assign containers
ResourceLimits childLimits = getResourceLimitsOfChild(childQueue,
clusterResource, resourceLimits,
RMNodeLabelsManager.NO_LABEL, false);
childQueue.updateClusterResource(clusterResource, childLimits);
  }
{code}

Same for allocation logic, we passed in: (Actually I found I added a TODO item 
5 years ago).

{code}
// Try to use NON_EXCLUSIVE
assignment = getRootQueue().assignContainers(getClusterResource(),
candidates,
// TODO, now we only consider limits for parent for non-labeled
// resources, should consider labeled resources as well.
new ResourceLimits(labelManager
.getResourceByLabel(RMNodeLabelsManager.NO_LABEL,
getClusterResource())),
SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY);
{code} 

The good thing is, in the assignContainers call, we calculated child limit 
based on partition
{code} 
ResourceLimits childLimits =
  getResourceLimitsOfChild(childQueue, cluster, limits,
  candidates.getPartition(), true);
{code} 

So I think now the problem is, when a named partition has more resource than 
default partition, effective min/max resource of each queue could be wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2020-12-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248048#comment-17248048
 ] 

Wangda Tan commented on YARN-10506:
---

I'm looking at PoC of the patch now, and will keep the JIRA updated in a day or 
two. 

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Priority: Major
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2020-12-10 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247615#comment-17247615
 ] 

Wangda Tan commented on YARN-10504:
---

The patch doesn't handle auto queue creation well, so you will see some 
failures which we will address them in the later revision.

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.ver-1.patch, 
> YARN-10504.ver-2.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2020-12-10 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247614#comment-17247614
 ] 

Wangda Tan commented on YARN-10504:
---

Thanks [~zhuqi], as we discussed, we're working on the patch in parallel, it 
will be great if you can let us know if you have intention to work on a patch 
so we can coordinate in the future. 

Also, as I shared offline to you, the patch you attached has the following 
issues: 
1) It changes CS config object, which makes hard for us to identify if it is a 
change caused by config file change or internal change. 
2) Auto created queue (which is why we add this feature) doesn't load config 
from CS config. 

I attached patch (see ver.2). It takes the following approach: 
1) Major updates are in updateClusterResource, which will be called after 
scheduler init and reinit.
2) Introduced new fields to QueueCapacities (weight, and normalized_weight). 
Based on normalized_weight, we will calculate absolute resource. 
3) Did a bunch of refactorings because code to handle changes of queue 
efficient resource and config is everywhere, very hard to troubleshoot and make 
further changes. I hope the refactoring in the patch can help code easier 
maintained. 

Can you please help to review it? [~zhuqi], [~bteke], [~pbacsko], [~shuzirra], 
[~snemeth], [~sunilg]

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.ver-1.patch, 
> YARN-10504.ver-2.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler

2020-12-10 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10504:
--
Attachment: YARN-10504.ver-2.patch

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.ver-1.patch, 
> YARN-10504.ver-2.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler

2020-12-10 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10504:
--
Attachment: YARN-10504.ver-1.patch

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10504.001.patch, YARN-10504.ver-1.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10169) Mixed absolute resource value and percentage-based resource value in CapacityScheduler should fail

2020-12-01 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241917#comment-17241917
 ] 

Wangda Tan commented on YARN-10169:
---

Thanks [~zhuqi] for working on this. We're currently making a bunch of changes 
to the scheduler to make FairScheduler users can easier to migrate to 
CapacityScheduler. In fairScheduler, it supports mixed weights and absolute 
valued max capacity (such as X memory, Y vcores) for each queue. 

I actually confused about the behavior in CapacityScheduler after seeing this 
JIRA. For a queue structure like below: 
{code:java}
root
   \
a
   / \
  a1  a2
 /   \
a2_1  a2_2{code}

Do we allow scheduler max capacity like: 
 
a.max (absolute), a1.max (percentage), a2.max (absolute), a2_1.max (percentage).

How we calculate a2_1.max (percentage below absolute) today?

cc: [~pbacsko], [~snemeth], [~sunilg], [~bteke]

> Mixed absolute resource value and percentage-based resource value in 
> CapacityScheduler should fail
> --
>
> Key: YARN-10169
> URL: https://issues.apache.org/jira/browse/YARN-10169
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Blocker
> Attachments: YARN-10169.001.patch, YARN-10169.002.patch, 
> YARN-10169.003.patch
>
>
> To me this is a bug: if there's a queue has capacity set to float, and 
> maximum-capacity set to absolute value. Existing logic allows the behavior.
> For example:
> {code:java}
> queue.capacity = 0.8 
> queue.maximum-capacity = [mem=x, vcore=y] {code}
> We should throw exception when configured like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10504) Implement weight mode in Capacity Scheduler

2020-11-30 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241173#comment-17241173
 ] 

Wangda Tan commented on YARN-10504:
---

[~epayne], This is a feature to allow flexibly add and remove queues. You can 
take a look at problem statement of 
[https://docs.google.com/document/d/1r_UU1OXCjvvxudNMH-KGwK0gwyt7swUSn62eKQXQOAg/edit#heading=h.ufvkgjf6tzq2]

And if you can review it and let us know if you have any other questions that 
will be greatly helpful! 

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2020-11-25 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238837#comment-17238837
 ] 

Wangda Tan commented on YARN-10497:
---

[~shuzirra], I would still prefer to use the 
CapacitySchedulerConfiguration#getQueues, it is the same method we used in 
other places, and we should use the same util function to avoid future problems 
like this. Plz let me know what your thoughts are.

Attached ver.4 to fix the remaining checkstyle issue. 

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-10497.001.patch, YARN-10497.002.patch, 
> YARN-10497.003.patch, YARN-10497.004.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2020-11-25 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10497:
--
Attachment: YARN-10497.004.patch

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-10497.001.patch, YARN-10497.002.patch, 
> YARN-10497.003.patch, YARN-10497.004.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2020-11-24 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238410#comment-17238410
 ] 

Wangda Tan commented on YARN-10497:
---

Attached ver.3 patch, fixed checkstyle, and unit test.

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-10497.001.patch, YARN-10497.002.patch, 
> YARN-10497.003.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2020-11-24 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10497:
--
Attachment: YARN-10497.003.patch

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-10497.001.patch, YARN-10497.002.patch, 
> YARN-10497.003.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2020-11-24 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238365#comment-17238365
 ] 

Wangda Tan commented on YARN-10497:
---

Patch 002, fixed check style issue, and unit test issues. (cc: [~pbacsko])

[~shuzirra], I didn't see I used getStringCollection in the latest patch, I 
used config.getQueues instead, it doesn't use any regex to do trimming, it uses 
String.trim(). Can you help to check it again?

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-10497.001.patch, YARN-10497.002.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2020-11-24 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10497:
--
Attachment: YARN-10497.002.patch

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-10497.001.patch, YARN-10497.002.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2020-11-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236509#comment-17236509
 ] 

Wangda Tan commented on YARN-10497:
---

[~snemeth], [~pbacsko], [~shuzirra], [~bteke] can you help to review the patch? 

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-10497.001.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2020-11-20 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10497:
--
Attachment: YARN-10497.001.patch

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-10497.001.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2020-11-20 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10497:
-

 Summary: Fix an issue in CapacityScheduler which fails to delete 
queues
 Key: YARN-10497
 URL: https://issues.apache.org/jira/browse/YARN-10497
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wangda Tan


We saw an exception when using queue mutation APIs:
{code:java}
2020-11-13 16:47:46,327 WARN 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
CapacityScheduler configuration validation failed:java.io.IOException: Queue 
root.am2cmQueueSecond not found
{code}
Which comes from this code:
{code:java}
List siblingQueues = getSiblingQueues(queueToRemove,
proposedConf);
if (!siblingQueues.contains(queueName)) {
  throw new IOException("Queue " + queueToRemove + " not found");
} 
{code}
(Inside MutableCSConfigurationProvider)

If you look at the method:
{code:java}
 
  private List getSiblingQueues(String queuePath, Configuration conf) {
String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
parentQueue + CapacitySchedulerConfiguration.DOT +
CapacitySchedulerConfiguration.QUEUES;
return new ArrayList<>(conf.getStringCollection(childQueuesKey));
  }
{code}
And here's capacity-scheduler.xml I got
{code:java}
yarn.scheduler.capacity.root.queuesdefault, q1, 
q2
{code}
You can notice there're spaces between default, q1, a2

So conf.getStringCollection returns:
{code:java}
default
q1
...
{code}
Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2020-11-20 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-10497:
-

Assignee: Wangda Tan

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10482) Capacity Scheduler seems locked,RM cannot submit any new job,and change active RM manually return to normal

2020-11-20 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236286#comment-17236286
 ] 

Wangda Tan commented on YARN-10482:
---

Thanks [~jiwq]! it is very helpful!

> Capacity Scheduler seems locked,RM cannot submit any new job,and change 
> active RM  manually return to normal
> 
>
> Key: YARN-10482
> URL: https://issues.apache.org/jira/browse/YARN-10482
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager, 
> RM
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
> Attachments: RM_normal_state.stack, RM_unnormal_state.stack
>
>
> Capacity Scheduler seems locked,RM cannot submit any new job, and change 
> active RM manually return to normal。its a serious bug!I check the stack 
> log,and found some info about *ReentrantReadWriteLock。*Can  anyone can solve 
> this issue?I uploaded the stack when RM normally and unnormally。RM  hangs 
> forever until I restart RM or change the active RM manually!!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10492) deadlock in rm

2020-11-19 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17235833#comment-17235833
 ] 

Wangda Tan commented on YARN-10492:
---

That will be helpful, thanks Jufeng!

> deadlock in rm 
> ---
>
> Key: YARN-10492
> URL: https://issues.apache.org/jira/browse/YARN-10492
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: brick yang
>Priority: Critical
>  Labels: 3.1.1
>
> version: HDP-3.1.5.0-152   hadoop3.1
> capacity scheduler
> yarn sometimes not change to active
> we found that jstack dump has deadlocked:
> "IPC Server handler 44 on 8030" #316 daemon prio=5 os_prio=0 
> tid=0x7fee8216e800 nid=0x63edc waiting for monitor entry 
> [0x7fee09633000]
>  java.lang.Thread.State: BLOCKED (on object monitor)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:323)
>  - waiting to lock <0x00043e2e19d0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.finishApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:75)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:97)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
>  
>  
>  
>  
>  
>  
> "IPC Server handler 8 on 8030" #280 daemon prio=5 os_prio=0 
> tid=0x7fee83823800 nid=0x63eb8 waiting on condition [0x7fee0ba57000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0003c0d0d6c0> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1664)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1997)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:676)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.releaseContainers(AbstractYarnScheduler.java:753)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocate(CapacityScheduler.java:1182)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:279)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.SchedulerPlacementProcessor.allocate(SchedulerPlacementProcessor.java:53)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:433)
>  - locked <0x00043e2e19d0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at 

[jira] [Commented] (YARN-10496) [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler

2020-11-19 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17235683#comment-17235683
 ] 

Wangda Tan commented on YARN-10496:
---

Worked with [~bteke] for a design doc, see the linked doc. Would like to see 
more comments from the community:

cc: [~epayne], [~jhung], [~tangzhankun], [~bilwa_st]

> [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler
> -
>
> Key: YARN-10496
> URL: https://issues.apache.org/jira/browse/YARN-10496
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Priority: Major
>
> CapacityScheduler today doesn’t support an auto queue creation which is 
> flexible enough. The current constraints: 
>  * Only leaf queues can be auto-created
>  * A parent can only have either static queues or dynamic ones. This causes 
> multiple constraints. For example:
>  * It isn’t possible to have a VIP user like Alice with a static queue 
> root.user.alice with 50% capacity while the other user queues (under 
> root.user) are created dynamically and they share the remaining 50% of 
> resources.
>  
>  * In comparison, FairScheduler allows the following scenarios, Capacity 
> Scheduler doesn’t:
>  ** This implies that there is no possibility to have both dynamically 
> created and static queues at the same time under root
>  * A new queue needs to be created under an existing parent, while the parent 
> already has static queues
>  * Nested queue mapping policy, like in the following example: 
> |
> 
> |
>  * Here two levels of queues may need to be created 
> If an application belongs to user _alice_ (who has the primary_group of 
> _engineering_), the scheduler checks whether _root.engineering_ exists, if it 
> doesn’t,  it’ll be created. Then scheduler checks whether 
> _root.engineering.alice_ exists, and creates it if it doesn't.
>  
> When we try to move users from FairScheduler to CapacityScheduler, we face 
> feature gaps which blocks users migrate from FS to CS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10496) [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler

2020-11-19 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10496:
-

 Summary: [Umbrella] Support Flexible Auto Queue Creation in 
Capacity Scheduler
 Key: YARN-10496
 URL: https://issues.apache.org/jira/browse/YARN-10496
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: capacity scheduler
Reporter: Wangda Tan


CapacityScheduler today doesn’t support an auto queue creation which is 
flexible enough. The current constraints: 
 * Only leaf queues can be auto-created
 * A parent can only have either static queues or dynamic ones. This causes 
multiple constraints. For example:

 * It isn’t possible to have a VIP user like Alice with a static queue 
root.user.alice with 50% capacity while the other user queues (under root.user) 
are created dynamically and they share the remaining 50% of resources.

 
 * In comparison, FairScheduler allows the following scenarios, Capacity 
Scheduler doesn’t:
 ** This implies that there is no possibility to have both dynamically created 
and static queues at the same time under root
 * A new queue needs to be created under an existing parent, while the parent 
already has static queues
 * Nested queue mapping policy, like in the following example: 

|

|
 * Here two levels of queues may need to be created 

If an application belongs to user _alice_ (who has the primary_group of 
_engineering_), the scheduler checks whether _root.engineering_ exists, if it 
doesn’t,  it’ll be created. Then scheduler checks whether 
_root.engineering.alice_ exists, and creates it if it doesn't.

 

When we try to move users from FairScheduler to CapacityScheduler, we face 
feature gaps which blocks users migrate from FS to CS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10492) deadlock in rm

2020-11-17 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233880#comment-17233880
 ] 

Wangda Tan commented on YARN-10492:
---

cc: [~snemeth]  , [~pbacsko], [~tangzhankun], [~sunil.gov...@gmail.com]. 

> deadlock in rm 
> ---
>
> Key: YARN-10492
> URL: https://issues.apache.org/jira/browse/YARN-10492
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: brick yang
>Priority: Critical
>  Labels: 3.1.1
>
> version: HDP-3.1.5.0-152   hadoop3.1
> capacity scheduler
> yarn sometimes not change to active
> we found that jstack dump has deadlocked:
> "IPC Server handler 44 on 8030" #316 daemon prio=5 os_prio=0 
> tid=0x7fee8216e800 nid=0x63edc waiting for monitor entry 
> [0x7fee09633000]
>  java.lang.Thread.State: BLOCKED (on object monitor)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:323)
>  - waiting to lock <0x00043e2e19d0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.finishApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:75)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:97)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
>  
>  
>  
>  
>  
>  
> "IPC Server handler 8 on 8030" #280 daemon prio=5 os_prio=0 
> tid=0x7fee83823800 nid=0x63eb8 waiting on condition [0x7fee0ba57000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0003c0d0d6c0> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1664)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1997)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:676)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.releaseContainers(AbstractYarnScheduler.java:753)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocate(CapacityScheduler.java:1182)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:279)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.SchedulerPlacementProcessor.allocate(SchedulerPlacementProcessor.java:53)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:433)
>  - locked <0x00043e2e19d0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at 

[jira] [Updated] (YARN-10458) Hive On Tez queries fails upon submission to dynamically created pools

2020-10-30 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-10458:
--
Fix Version/s: 3.4.0

> Hive On Tez queries fails upon submission to dynamically created pools
> --
>
> Key: YARN-10458
> URL: https://issues.apache.org/jira/browse/YARN-10458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Anand Srinivasan
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10458-001.patch, YARN-10458-002.patch, 
> YARN-10458-003.patch, YARN-10458-004.patch
>
>
> While using Dynamic Auto-Creation and Management of Leaf Queues, we could see 
> that the queue creation fails because ACL submit application check couldn't 
> succeed.
> We tried setting acl_submit_applications to '*' for managed parent queues. 
> For static queues, this worked but failed for dynamic queues. Also tried 
> setting the below property but it didn't help either.
> yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*.
> RM error log shows the following :
> 2020-09-18 01:08:40,579 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule:
>  Application application_1600399068816_0460 user user1 mapping [default] to 
> [queue1] override false
> 2020-09-18 01:08:40,579 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: User 'user1' from 
> application tag does not have access to  queue 'user1'. The placement is done 
> for user 'hive'
>  
> Checking the code, scheduler#checkAccess() bails out even before checking the 
> ACL permissions for that particular queue because the CSQueue is null.
> {code:java}
> public boolean checkAccess(UserGroupInformation callerUGI,
> QueueACL acl, String queueName) {
> CSQueue queue = getQueue(queueName);
> if (queue == null) {
> if (LOG.isDebugEnabled())
> { LOG.debug("ACL not found for queue access-type " + acl + " for queue " + 
> queueName); }
> return false;*<-- the method returns false here.*
> }
> return queue.hasAccess(acl, callerUGI);
> }
> {code}
> As this is an auto created queue, CSQueue may be null in this case. May be 
> scheduler#checkAccess() should have a logic to differentiate when CSQueue is 
> null and if queue mapping is involved and if so, check if the parent queue 
> exists and is a managed parent and if so, check if the parent queue has valid 
> ACL's instead of returning false ?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10458) Hive On Tez queries fails upon submission to dynamically created pools

2020-10-30 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223771#comment-17223771
 ] 

Wangda Tan commented on YARN-10458:
---

I just committed the patch to trunk, thanks [~anand.srinivasan] for reporting 
this issue and thanks [~pbacsko] for working on the patch.

[~pbacsko] can you help to backport to corresponding branches? 

> Hive On Tez queries fails upon submission to dynamically created pools
> --
>
> Key: YARN-10458
> URL: https://issues.apache.org/jira/browse/YARN-10458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Anand Srinivasan
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10458-001.patch, YARN-10458-002.patch, 
> YARN-10458-003.patch, YARN-10458-004.patch
>
>
> While using Dynamic Auto-Creation and Management of Leaf Queues, we could see 
> that the queue creation fails because ACL submit application check couldn't 
> succeed.
> We tried setting acl_submit_applications to '*' for managed parent queues. 
> For static queues, this worked but failed for dynamic queues. Also tried 
> setting the below property but it didn't help either.
> yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*.
> RM error log shows the following :
> 2020-09-18 01:08:40,579 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule:
>  Application application_1600399068816_0460 user user1 mapping [default] to 
> [queue1] override false
> 2020-09-18 01:08:40,579 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: User 'user1' from 
> application tag does not have access to  queue 'user1'. The placement is done 
> for user 'hive'
>  
> Checking the code, scheduler#checkAccess() bails out even before checking the 
> ACL permissions for that particular queue because the CSQueue is null.
> {code:java}
> public boolean checkAccess(UserGroupInformation callerUGI,
> QueueACL acl, String queueName) {
> CSQueue queue = getQueue(queueName);
> if (queue == null) {
> if (LOG.isDebugEnabled())
> { LOG.debug("ACL not found for queue access-type " + acl + " for queue " + 
> queueName); }
> return false;*<-- the method returns false here.*
> }
> return queue.hasAccess(acl, callerUGI);
> }
> {code}
> As this is an auto created queue, CSQueue may be null in this case. May be 
> scheduler#checkAccess() should have a logic to differentiate when CSQueue is 
> null and if queue mapping is involved and if so, check if the parent queue 
> exists and is a managed parent and if so, check if the parent queue has valid 
> ACL's instead of returning false ?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10458) Hive On Tez queries fails upon submission to dynamically created pools

2020-10-30 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223735#comment-17223735
 ] 

Wangda Tan commented on YARN-10458:
---

+1, thanks [~pbacsko], will get it committed later today. 

> Hive On Tez queries fails upon submission to dynamically created pools
> --
>
> Key: YARN-10458
> URL: https://issues.apache.org/jira/browse/YARN-10458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Anand Srinivasan
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10458-001.patch, YARN-10458-002.patch, 
> YARN-10458-003.patch, YARN-10458-004.patch
>
>
> While using Dynamic Auto-Creation and Management of Leaf Queues, we could see 
> that the queue creation fails because ACL submit application check couldn't 
> succeed.
> We tried setting acl_submit_applications to '*' for managed parent queues. 
> For static queues, this worked but failed for dynamic queues. Also tried 
> setting the below property but it didn't help either.
> yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*.
> RM error log shows the following :
> 2020-09-18 01:08:40,579 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule:
>  Application application_1600399068816_0460 user user1 mapping [default] to 
> [queue1] override false
> 2020-09-18 01:08:40,579 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: User 'user1' from 
> application tag does not have access to  queue 'user1'. The placement is done 
> for user 'hive'
>  
> Checking the code, scheduler#checkAccess() bails out even before checking the 
> ACL permissions for that particular queue because the CSQueue is null.
> {code:java}
> public boolean checkAccess(UserGroupInformation callerUGI,
> QueueACL acl, String queueName) {
> CSQueue queue = getQueue(queueName);
> if (queue == null) {
> if (LOG.isDebugEnabled())
> { LOG.debug("ACL not found for queue access-type " + acl + " for queue " + 
> queueName); }
> return false;*<-- the method returns false here.*
> }
> return queue.hasAccess(acl, callerUGI);
> }
> {code}
> As this is an auto created queue, CSQueue may be null in this case. May be 
> scheduler#checkAccess() should have a logic to differentiate when CSQueue is 
> null and if queue mapping is involved and if so, check if the parent queue 
> exists and is a managed parent and if so, check if the parent queue has valid 
> ACL's instead of returning false ?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10458) Hive On Tez queries fails upon submission to dynamically created pools

2020-10-29 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223117#comment-17223117
 ] 

Wangda Tan commented on YARN-10458:
---

[~pbacsko], there're two issues in the test, one is setup NodelabelManager 
after RM created, it somehow didn't get the right label manager (I didn't do 
further troubleshooting), the correct way to do it is: 
{code:java}
MockRM rm = new MockRM(csConf) {
  @Override
  public RMNodeLabelsManager createNodeLabelManager() {
return mgr;
  }
}; {code}
The label manager is used by scheduler to correctly calculate effective 
resources: 
*org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AutoCreatedLeafQueue#mergeCapacities*
{code:java}
Resource resourceByLabel = labelManager.getResourceByLabel(nodeLabel,
csContext.getClusterResource()); {code}
So it causes the app cannot move to RUNNING because effective is always 0.

Second issue is, you called a nodeHearbeat before launchAndRegisterAM, it makes 
app attempt advanced to ALLOCATED state instead of SCHEDULED state. after 
removed the hearbeat call, it works fine now. 

Please add more checks for the queue creation, and I suggest to move this test 
to TestCapacitySchedulerAutoQUeueCreation.

A good resource to reference to is tests inside 
`TestNodeLabelContainerAllocation` if you write a test, reference to 
TestNodeLabelContainerAllocation will be a good starting point.

Here's full test code after changes I made: 
{code:java}
 @Test
public void testAccessCheckOfNonExistingDynamicQueueWithTags()
throws Exception {
  CapacitySchedulerConfiguration csConf
  = new CapacitySchedulerConfiguration();
  csConf.setQueues(CapacitySchedulerConfiguration.ROOT,
  new String[] {"a", "b"});
  csConf.setCapacity("root.a", 90);
  csConf.setCapacity("root.b", 10);
  csConf.set("yarn.scheduler.capacity.resource-calculator",
  "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator");
  csConf.setAutoCreateChildQueueEnabled("root.a", true);
  csConf.setAutoCreatedLeafQueueConfigCapacity("root.a", 50);
  csConf.setAutoCreatedLeafQueueConfigMaxCapacity("root.a", 100);
  
csConf.set(CapacitySchedulerConfiguration.MAXIMUM_APPLICATION_MASTERS_RESOURCE_PERCENT,
  "0.5");
  csConf.setAcl("root.a", QueueACL.ADMINISTER_QUEUE, "*");
  csConf.setAcl("root.a", QueueACL.SUBMIT_APPLICATIONS, "*");
  csConf.setBoolean(YarnConfiguration
  .APPLICATION_TAG_BASED_PLACEMENT_ENABLED, true);
  csConf.setStrings(YarnConfiguration
  .APPLICATION_TAG_BASED_PLACEMENT_USER_WHITELIST, "hadoop");
  csConf.set(CapacitySchedulerConfiguration.QUEUE_MAPPING, 
"u:%user:root.a.%user");
  csConf.setInt("yarn.scheduler.minimum-allocation-mb", 1024);
  csConf.setInt("yarn.scheduler.minimum-allocation-vcores", 1);

  YarnConfiguration conf=new YarnConfiguration(csConf);
  conf.setClass(YarnConfiguration.RM_SCHEDULER, CapacityScheduler.class,
  ResourceScheduler.class);
  RMNodeLabelsManager mgr = new NullRMNodeLabelsManager();
  mgr.init(conf);
  MockRM rm = new MockRM(csConf) {
@Override
public RMNodeLabelsManager createNodeLabelManager() {
  return mgr;
}
  };
  rm.start();
  MockNM nm = rm.registerNode("127.0.0.1:1234", 16 * GB);

  MockRMAppSubmissionData data =
  MockRMAppSubmissionData.Builder.createWithMemory(GB, rm)
  .withAppName("apptodynamicqueue")
  .withUser("hadoop")
  .withAcls(null)
  .withUnmanagedAM(false)
  .withApplicationTags(Sets.newHashSet("userid=testuser"))
  .build();
  RMApp app = MockRMAppSubmitter.submit(rm, data);
  MockRM.launchAndRegisterAM(app, rm, nm); // stuck in SCHEDULED state

  nm.nodeHeartbeat(true);
}{code}

> Hive On Tez queries fails upon submission to dynamically created pools
> --
>
> Key: YARN-10458
> URL: https://issues.apache.org/jira/browse/YARN-10458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Anand Srinivasan
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10458-001.patch, YARN-10458-002.patch
>
>
> While using Dynamic Auto-Creation and Management of Leaf Queues, we could see 
> that the queue creation fails because ACL submit application check couldn't 
> succeed.
> We tried setting acl_submit_applications to '*' for managed parent queues. 
> For static queues, this worked but failed for dynamic queues. Also tried 
> setting the below property but it didn't help either.
> yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*.
> RM error log shows the following :
> 2020-09-18 01:08:40,579 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule:
>  Application application_1600399068816_0460 user user1 mapping 

[jira] [Commented] (YARN-10425) Replace the legacy placement engine in CS with the new one

2020-10-27 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221619#comment-17221619
 ] 

Wangda Tan commented on YARN-10425:
---

Thanks [~shuzirra], this is a really big and important effort :). 

I don't think I will have a chance to review the code, just one ask: let's make 
sure the behavior is backward compatible, and let's keep the original test 
cases in place (I remember we have a bunch of them).

> Replace the legacy placement engine in CS with the new one
> --
>
> Key: YARN-10425
> URL: https://issues.apache.org/jira/browse/YARN-10425
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-10425.001.patch, YARN-10425.002.patch
>
>
> Remove the UserGroupMapping and ApplicationName mapping classes, and use the 
> new CSMappingPlacementRule instead. Also cleanup the orphan classes which are 
> used by these classes only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10458) Hive On Tez queries fails upon submission to dynamically created pools

2020-10-27 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221616#comment-17221616
 ] 

Wangda Tan commented on YARN-10458:
---

It looks good to me, the only question/ask is since this issue occurred when we 
have queue placement based on app tag userid + queue auto creation, can we make 
sure either a test added for that (preferred), we need definitely set up a 
single node cluster and make sure it works in the scenario.

> Hive On Tez queries fails upon submission to dynamically created pools
> --
>
> Key: YARN-10458
> URL: https://issues.apache.org/jira/browse/YARN-10458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Anand Srinivasan
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10458-001.patch, YARN-10458-002.patch
>
>
> While using Dynamic Auto-Creation and Management of Leaf Queues, we could see 
> that the queue creation fails because ACL submit application check couldn't 
> succeed.
> We tried setting acl_submit_applications to '*' for managed parent queues. 
> For static queues, this worked but failed for dynamic queues. Also tried 
> setting the below property but it didn't help either.
> yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*.
> RM error log shows the following :
> 2020-09-18 01:08:40,579 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule:
>  Application application_1600399068816_0460 user user1 mapping [default] to 
> [queue1] override false
> 2020-09-18 01:08:40,579 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: User 'user1' from 
> application tag does not have access to  queue 'user1'. The placement is done 
> for user 'hive'
>  
> Checking the code, scheduler#checkAccess() bails out even before checking the 
> ACL permissions for that particular queue because the CSQueue is null.
> {code:java}
> public boolean checkAccess(UserGroupInformation callerUGI,
> QueueACL acl, String queueName) {
> CSQueue queue = getQueue(queueName);
> if (queue == null) {
> if (LOG.isDebugEnabled())
> { LOG.debug("ACL not found for queue access-type " + acl + " for queue " + 
> queueName); }
> return false;*<-- the method returns false here.*
> }
> return queue.hasAccess(acl, callerUGI);
> }
> {code}
> As this is an auto created queue, CSQueue may be null in this case. May be 
> scheduler#checkAccess() should have a logic to differentiate when CSQueue is 
> null and if queue mapping is involved and if so, check if the parent queue 
> exists and is a managed parent and if so, check if the parent queue has valid 
> ACL's instead of returning false ?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10458) Hive On Tez queries fails upon submission to dynamically created pools

2020-10-26 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220808#comment-17220808
 ] 

Wangda Tan commented on YARN-10458:
---

[~pbacsko], 

Thanks for working on the patch, 

I'm trying to understand if

{{929       String queue = appPlacementContext.getQueue();}}

always returns relative queue path, if no, the below logic could be wrong: 
{code:java}
931  if (parent != null) {
932queue = parent + "." + queue;
933  } {code}
And do we always have full qualified queue path in the queue mapping setting? 
{code:java}
2295  // can only check proper ACLs if the path is fully qualified
2296  while (queue == null || !queueName.equals("root")) { {code}
I don't fully remember this part of logic.

> Hive On Tez queries fails upon submission to dynamically created pools
> --
>
> Key: YARN-10458
> URL: https://issues.apache.org/jira/browse/YARN-10458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Anand Srinivasan
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10458-001.patch
>
>
> While using Dynamic Auto-Creation and Management of Leaf Queues, we could see 
> that the queue creation fails because ACL submit application check couldn't 
> succeed.
> We tried setting acl_submit_applications to '*' for managed parent queues. 
> For static queues, this worked but failed for dynamic queues. Also tried 
> setting the below property but it didn't help either.
> yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*.
> RM error log shows the following :
> 2020-09-18 01:08:40,579 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule:
>  Application application_1600399068816_0460 user user1 mapping [default] to 
> [queue1] override false
> 2020-09-18 01:08:40,579 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: User 'user1' from 
> application tag does not have access to  queue 'user1'. The placement is done 
> for user 'hive'
>  
> Checking the code, scheduler#checkAccess() bails out even before checking the 
> ACL permissions for that particular queue because the CSQueue is null.
> {code:java}
> public boolean checkAccess(UserGroupInformation callerUGI,
> QueueACL acl, String queueName) {
> CSQueue queue = getQueue(queueName);
> if (queue == null) {
> if (LOG.isDebugEnabled())
> { LOG.debug("ACL not found for queue access-type " + acl + " for queue " + 
> queueName); }
> return false;*<-- the method returns false here.*
> }
> return queue.hasAccess(acl, callerUGI);
> }
> {code}
> As this is an auto created queue, CSQueue may be null in this case. May be 
> scheduler#checkAccess() should have a logic to differentiate when CSQueue is 
> null and if queue mapping is involved and if so, check if the parent queue 
> exists and is a managed parent and if so, check if the parent queue has valid 
> ACL's instead of returning false ?
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10178) Global Scheduler asycthread crash caused by 'Comparison method violates its general contract'

2020-10-19 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217285#comment-17217285
 ] 

Wangda Tan edited comment on YARN-10178 at 10/20/20, 5:33 AM:
--

Since recently we have a customer has the same issue, I spent some time to look 
at this, thanks [~tuyu] for the detailed analysis. 

Apart from issues mentioned by [~tuyu], which is async-scheduling related, I 
think it can also happen when async-scheduling is disabled. 

A possible place is completedContainer, it doesn't hold scheduler's lock, so it 
can happen even though async-scheduling is disabled. (I checked the problematic 
log, there's a container release event happened at the exact same timestamp 
(within the same milli-second) when the RM crash happens. 

One possible way to fix the problem is, inside 
PriorityUtilizationQueueOrderingPolicy, take a snapshot of queue capacities 
(which includes 
{code:java}
 AbsoluteUsedCapacity
 UsedCapacity
 ConfiguredMinResource
 AbsoluteCapacity

And plus CSQueue's reference
){code}
Create a new internal class (like PriorityQueueResourcesForSorting) to include 
the 4 fields, instead of sorting the CSQueue directly, we will sort the new 
structure. 

There're additional costs to copy the resources field, but it should be minimum 
for most cases (unless you have thousands of queues). cc: [~bteke] 


was (Author: wangda):
Since recently we have a customer has the same issue, I spent some time to look 
at this, thanks [~tuyu] for the detailed analysis. 

Apart from issues mentioned by [~tuyu], which is async-scheduling related, I 
think it can also happen when async-scheduling is disabled. 

A possible place is completedContainer, it doesn't hold scheduler's lock, so it 
can happen even though async-scheduling is disabled. (I checked the problematic 
log, there's a container release event happened at the exact same timestamp 
(within the same milli-second) when the RM crash happens. 

One possible way to fix the problem is, inside 
PriorityUtilizationQueueOrderingPolicy, take a snapshot of queue capacities 
(which includes 
AbsoluteUsedCapacity
UsedCapacity
ConfiguredMinResource
AbsoluteCapacity
)

Create a new internal class (like PriorityQueueResourcesForSorting) to include 
the 4 fields, instead of sorting the CSQueue directly, we will sort the new 
structure. 

There're additional costs to copy the resources field, but it should be minimum 
for most cases (unless you have thousands of queues). cc: [~bteke] 

> Global Scheduler asycthread crash caused by 'Comparison method violates its 
> general contract'
> -
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Priority: Major
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
>

[jira] [Commented] (YARN-10178) Global Scheduler asycthread crash caused by 'Comparison method violates its general contract'

2020-10-19 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217285#comment-17217285
 ] 

Wangda Tan commented on YARN-10178:
---

Since recently we have a customer has the same issue, I spent some time to look 
at this, thanks [~tuyu] for the detailed analysis. 

Apart from issues mentioned by [~tuyu], which is async-scheduling related, I 
think it can also happen when async-scheduling is disabled. 

A possible place is completedContainer, it doesn't hold scheduler's lock, so it 
can happen even though async-scheduling is disabled. (I checked the problematic 
log, there's a container release event happened at the exact same timestamp 
(within the same milli-second) when the RM crash happens. 

One possible way to fix the problem is, inside 
PriorityUtilizationQueueOrderingPolicy, take a snapshot of queue capacities 
(which includes 
AbsoluteUsedCapacity
UsedCapacity
ConfiguredMinResource
AbsoluteCapacity
)

Create a new internal class (like PriorityQueueResourcesForSorting) to include 
the 4 fields, instead of sorting the CSQueue directly, we will sort the new 
structure. 

There're additional costs to copy the resources field, but it should be minimum 
for most cases (unless you have thousands of queues). cc: [~bteke] 

> Global Scheduler asycthread crash caused by 'Comparison method violates its 
> general contract'
> -
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Priority: Major
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function 

  1   2   3   4   5   6   7   8   9   10   >