Craig Welch commented on YARN-1198:

I initially considered an approach like this one, but did not go in that 
direction for a couple of reasons.  One is that, to avoid introducing a 
calculation during the heartbeat, you do end up iterating all the users in the 
queue with every headroom calculation.  While this may generally be less than 
iterating all of the applications in a queue it may still be fairly significant 
in some usage patterns, and in a worst case (different user for each 
application) it is exactly equivalent to what we are trying to avoid.  The 
other is the "Resource required" which is application specific and included in 
the userlimit calculation - the comments indicate this  

ensures that jobs in queues
//   with miniscule capacity (< 1 slot) make progress

- I notice that updateClusterResource just provides the ".none" for this value 
- so it is not being honored in all cases, but I'm concerned about breaking the 
case it is meant to handle by detaching it generally from the headroom 
calculation.  Handling this value as we do today requires an application 
specific calculation - hence placing it in the application path and handling it 
as I do in the .7 patch during heartbeat/using an application specific value.  
If we move to calculating it at the user level then we would have to choose one 
value for the "required" from one of the user's applications to avoid iterating 
them otherwise we are back to iterating all applications at each go.  In a 
practical sense that might be fine, unless different applications for the same 
user are passing significantly different values for "required" - I suppose we 
could use a "max" for that value, but then an unusually large value for 
"required" could be carried forward indefinitely (for as long as a user has 
active applications) - or we could just use the last one provided for that user 
and understand that it changes the results a bit, possibly in an undesired way.

Couple of other points:

-re "we don't need write HeadroomProvider for each scheduler" - we already 
don't need one - the base implementation I've provided maintains the current 
behavior for other schedulers, and it appears that other schedulers may not 
require the same treatment as they do not necessarily vary their headroom as 
dynamically/in the interrelated way that the capacity scheduler does - in any 
case, the pattern I'm introducing here can be reused by them - but they would, 
in any case, require their own logic to effect this kind of update if they 
require it.

-re "As mentioned by Jason, currently, fair/capacity scheduler all support 
moving app between queues, we should recompute and change the reference after 
finished moving app" 
I take this to properly be a task to take on when providing support for moving 
between queues - not having the location in code at present where this will 
happen prevents me from really addressing it, it's not part of the current 
effort, and in any case this change is not making that any more difficult (it 
may be making it easier... hard to be sure until we're ready to do it... but I 
am sure it is not making it more difficult - The first time an application 
calls computUserLimit... after it is moved it will automatically update to the 
proper configuration to provide headroom from then onward, with no other 
changes so far as I can see.  We could also effect this by simply setting the 
headroom provider during the move.)

Provider vs Reference - I went with a more general term as I'm not sure that in 
all cases it will be simple reference/will have no logic of it's own - Provider 
is a superset/more generic term :-)

-re the cost of the calculation - if you look through the code, it's factored 
such that everything is referring to local members of a relatively small object 
graph - basically, it's just "doing a few member lookups and a little math (I 
know, you could say that about anything - but in this case, it really isn't 
very much)" - no significant data structures have to be accessed and while it's 
hidden behind calls to Resources it really is just a bit of calculation...

That said, I can see benefits to avoiding some of the work being done in the 
heartbeat - the one hard limit is the impact to how the "Resource required" 
value is handled, possibly not a significant tradeoff.  I also had some 
concurrency concerns - by moving this out to the heartbeat we are accessing 
some shared Resource values concurrently which are not at present, and I ran 
into some concurrency issues with LeafQueue when making the change (all 
resolved, but caused some alarm/required some workaround) - there could be 
other latent concurrency issues there which will be corner cases, where if we 
have all calculation happening in the calculate... call in leaf queue we will 
be back to the earlier model/can be pretty assured there are no new concurrency 
issues in play.  When looking at this user-centric approach my intention was to 
attach the headroom value(s) to the User instance which is already present in a 
user(String)->User map in the LeafQueue - it's not necessary to introduce a new 
map, we already have something which is holding User level resource information 
which could be used consistently forward for the headroom calculation.  I'll 
post a patch with that approach, we can have a look at it in comparison to .7 
and .4 and decide on a direction

> Capacity Scheduler headroom calculation does not work as expected
> -----------------------------------------------------------------
>                 Key: YARN-1198
>                 URL: https://issues.apache.org/jira/browse/YARN-1198
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Omkar Vinit Joshi
>            Assignee: Craig Welch
>         Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, 
> YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch
> Today headroom calculation (for the app) takes place only when
> * New node is added/removed from the cluster
> * New container is getting assigned to the application.
> However there are potentially lot of situations which are not considered for 
> this calculation
> * If a container finishes then headroom for that application will change and 
> should be notified to the AM accordingly.
> * If a single user has submitted multiple applications (app1 and app2) to the 
> same queue then
> ** If app1's container finishes then not only app1's but also app2's AM 
> should be notified about the change in headroom.
> ** Similarly if a container is assigned to any applications app1/app2 then 
> both AM should be notified about their headroom.
> ** To simplify the whole communication process it is ideal to keep headroom 
> per User per LeafQueue so that everyone gets the same picture (apps belonging 
> to same user and submitted in same queue).
> * If a new user submits an application to the queue then all applications 
> submitted by all users in that queue should be notified of the headroom 
> change.
> * Also today headroom is an absolute number ( I think it should be normalized 
> but then this is going to be not backward compatible..)
> * Also  when admin user refreshes queue headroom has to be updated.
> These all are the potential bugs in headroom calculations

This message was sent by Atlassian JIRA

Reply via email to