[
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155474#comment-16155474
]
Jason Lowe commented on YARN-7149:
----------------------------------
Thanks for the report and analysis, Eric! So it appears YARN-5889's change to
try to balance the growth of users violated the preemption monitor's
forecasting of resource assignments. One way to fix this is to change the
preemption monitor's forecasting calculations to use the old user limit
calculations, however I'm wondering if we should revisit the decision to change
the user limit calculations in YARN-5889.
I understand the desire to try to balance user growth, but it seems like this
is going to significantly slow container assignment when there are multiple
active users to solve a problem that I'm not sure is a real problem in
practice. If I understand the concern properly, we want to try to avoid a
situation where one user can quickly rush ahead to their full user limit, well
ahead of the other users, and then before the other users get to their same
limit something happens (e.g.: more users become active, cluster loses
capacity, etc.). That window should be very small in practice (i.e.: a few
seconds to a few tens of seconds) because the user limit should reflect
capacity that is available right now. The speed at which the user limit is
reached should only be limited by the heartbeat rate of the nodes and how picky
the container requests are.
I'm concerned about the new approach because it looks like it will
significantly slow down container assignments. For example there are two
users, A and B, each with a single active application that is asking for many
more containers than the queue can provide. User A's app is ahead of user B's
app in the queue, and the queue is initially almost empty. Before the user
limit change, the user limits for each user would be 50% since they are the
only two active users in the queue. As nodes heartbeat into the scheduler, the
scheduler would aggressively assign containers, likely more than one for each
heartbeat, to user A until the 50% user limit is reached. At that point it
would switch to assigning containers to user B, again likely more than one per
node heartbeat. Unless the container requests are very picky, it should only
take two rounds or so of node heartbeats to satisfy both users which should
only be a small number of seconds. With the new limit calculation, the user
limits for A and B are going to be only the minimal increment over what they're
using. Therefore each node heartbeat will only assign one container to each
user rather than multiple since it will keep running into the user limit before
it grows. The end result is it will take a lot more node heartbeats to get
everything assigned. That will be perceived as a slow scheduler to users. Do
we really need to keep the assignments balanced as users grow to their limit?
It looks like it will be a significant performance hit to do so since we will
keep hitting the limit on each node heartbeat, cutting short the number of
containers we would normally assign per heartbeat.
> Cross-queue preemption sometimes starves an underserved queue
> -------------------------------------------------------------
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler
> Affects Versions: 2.9.0, 3.0.0-alpha3
> Reporter: Eric Payne
> Assignee: Eric Payne
>
> In branch 2 and trunk, I am consistently seeing some use cases where
> cross-queue preemption does not happen when it should. I do not see this in
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far
> underserved_
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]