[
https://issues.apache.org/jira/browse/YARN-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15715215#comment-15715215
]
Jason Lowe commented on YARN-5889:
----------------------------------
bq. This means that we will be doing same as what we do earlier too with some
minor improvements in a busy cluster
It shouldn't take a busy cluster to see the improvement. If a user is running
many applications that are all asking for resources but the user has hit the
user limit, today it will redundantly recompute the user limit for each
application on each heartbeat. The lazy-compute-when-dirty approach will not
compute it at all unless a container has been allocated or released for that
user in that queue. I would argue that's much more than a minor improvement,
and users hitting their limits is a common case on our clusters even when
they're not completely full.
The asynchronous approach is very concerning to me. We are essentially trading
correctness for performance, and that seems to me like a reckless pursuit when
there are still ways to improve performance without adding new race conditions
and constraint violations. Obviously moving the calculation outside of the
allocate thread will show significant improvements in benchmarks, but those
results don't show the cost of the scheduler violating its constraints. IMHO
that's a misleading result.
I also question the logic of relying on preemption and opportunistic containers
to "solve" the constraint violation problems. Both of those features aren't
free. Preemption loses work, and opportunistic containers aren't guaranteed to
be allocated in a timely manner (or could in turn be preempted). In theory
this should eventually converge to a more correct constraint value, but I would
argue at a cost of allocation latency and lost work.
This feature is blocking user-limit-based in-queue preemptions which we are
very eager to see. I propose we go with a simple approach that is easy to
implement and simple to prove correctness. Adding something that can violate
the schedulers constraints doesn't seem necessary to unblock the in-queue
preemption work. Let's get that work unblocked and we can continue to discuss
asynchronous constraint violation approaches in parallel.
> Improve user-limit calculation in capacity scheduler
> ----------------------------------------------------
>
> Key: YARN-5889
> URL: https://issues.apache.org/jira/browse/YARN-5889
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler
> Reporter: Sunil G
> Assignee: Sunil G
> Attachments: YARN-5889.v0.patch, YARN-5889.v1.patch,
> YARN-5889.v2.patch
>
>
> Currently user-limit is computed during every heartbeat allocation cycle with
> a write lock. To improve performance, this tickets is focussing on moving
> user-limit calculation out of heartbeat allocation flow.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]