[ 
https://issues.apache.org/jira/browse/YARN-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15715215#comment-15715215
 ] 

Jason Lowe commented on YARN-5889:
----------------------------------

bq. This means that we will be doing same as what we do earlier too with some 
minor improvements in a busy cluster

It shouldn't take a busy cluster to see the improvement.  If a user is running 
many applications that are all asking for resources but the user has hit the 
user limit, today it will redundantly recompute the user limit for each 
application on each heartbeat.  The lazy-compute-when-dirty approach will not 
compute it at all unless a container has been allocated or released for that 
user in that queue.  I would argue that's much more than a minor improvement, 
and users hitting their limits is a common case on our clusters even when 
they're not completely full.

The asynchronous approach is very concerning to me.  We are essentially trading 
correctness for performance, and that seems to me like a reckless pursuit when 
there are still ways to improve performance without adding new race conditions 
and constraint violations.  Obviously moving the calculation outside of the 
allocate thread will show significant improvements in benchmarks, but those 
results don't show the cost of the scheduler violating its constraints.  IMHO 
that's a misleading result.

I also question the logic of relying on preemption and opportunistic containers 
to "solve" the constraint violation problems.  Both of those features aren't 
free.  Preemption loses work, and opportunistic containers aren't guaranteed to 
be allocated in a timely manner (or could in turn be preempted).  In theory 
this should eventually converge to a more correct constraint value, but I would 
argue at a cost of allocation latency and lost work.

This feature is blocking user-limit-based in-queue preemptions which we are 
very eager to see.  I propose we go with a simple approach that is easy to 
implement and simple to prove correctness.  Adding something that can violate 
the schedulers constraints doesn't seem necessary to unblock the in-queue 
preemption work.  Let's get that work unblocked and we can continue to discuss 
asynchronous constraint violation approaches in parallel.

> Improve user-limit calculation in capacity scheduler
> ----------------------------------------------------
>
>                 Key: YARN-5889
>                 URL: https://issues.apache.org/jira/browse/YARN-5889
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>            Reporter: Sunil G
>            Assignee: Sunil G
>         Attachments: YARN-5889.v0.patch, YARN-5889.v1.patch, 
> YARN-5889.v2.patch
>
>
> Currently user-limit is computed during every heartbeat allocation cycle with 
> a write lock. To improve performance, this tickets is focussing on moving 
> user-limit calculation out of heartbeat allocation flow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to