[
https://issues.apache.org/jira/browse/YARN-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544602#comment-14544602
]
Wangda Tan commented on YARN-3388:
----------------------------------
Thanks updating [~nroberts], took at look at latest patch, some comments:
1) It may be better to rename rbl to partitionResource in a couple of places,
rbl is not a very clear name to me.
2) One bigger problem is, updateClusterResource only considered NO_LABEL, but
computeUserLimit uses getUsageRatio for all partitions. It will be inaccurate
if resource of partition updated.
Solution could be:
a. Only use getUsageRatio when partition=NO_LABEL
b. Recomputes all partitions when updateClusterResource.
I prefer b since other code path in your patch are all considered partitions.
You can take a look at CSQueueUtils#updateQueueStatistics, they should have
very similar logic to handle partitions when cluster resource updates.
3) It's better not put the user-usage-ratio in ResourceUsage, ResourceUsage is
targeting to track common resources for user/app/queue. I suggest to create a
ResourceUsage-like structure in LeafQueue, and User/LeafQueue will share it.
4) Better to split and rename User.updateUsageRatio to
User.updateAndGetDeltaOfDominateResourceRatio and
User.updateAndGetDominateResourceRatio, the "reset" parameter is not very
straight-forward to me.
> Allocation in LeafQueue could get stuck because DRF calculator isn't well
> supported when computing user-limit
> -------------------------------------------------------------------------------------------------------------
>
> Key: YARN-3388
> URL: https://issues.apache.org/jira/browse/YARN-3388
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Affects Versions: 2.6.0
> Reporter: Nathan Roberts
> Assignee: Nathan Roberts
> Attachments: YARN-3388-v0.patch, YARN-3388-v1.patch,
> YARN-3388-v2.patch
>
>
> When there are multiple active users in a queue, it should be possible for
> those users to make use of capacity up-to max_capacity (or close). The
> resources should be fairly distributed among the active users in the queue.
> This works pretty well when there is a single resource being scheduled.
> However, when there are multiple resources the situation gets more complex
> and the current algorithm tends to get stuck at Capacity.
> Example illustrated in subsequent comment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)