Thomas Graves commented on YARN-3434:

I am not saying child needs to know how parent calculate resource limit.  I am 
saying user limit and whether it needs to unreserve to make another reservation 
has nothing to do with the parent queue (ie it doesn't apply to parent queue).  
Remember I'm not needing to store user limit, I'm needing to store the fact of 
whether it needs to unreserve and if it does how much does it need to unreserve.

When a node heartbeats it goes through the regular assignments and updates the 
leafQueue clusterResources based on what the parent passes in. When a node is 
removed or added then it updates the resource limits (none of these apply to 
calculation of whether it needs to unreserve or not). 

Basically it comes down to is this information useful outside of the small 
window between when it calculates it and when its needed in assignContainer() 
and my thought is no.  And you said it yourself in last bullet above.  Although 
we have been referring to the userLImit and perhaps that is the problem.  I 
don't need to store the userLimit, I need to store whether it needs to 
unreserve and if so how much.  Therefore it fits better as a local transient 
variable rather then a globally stored one.  If you store just the userLImit 
then you need to recalculate stuff which I'm trying to avoid.

I understand why we are storing the current information in ResourceLimits 
because it has to do with headroom and parent limits and is recalculated at 
various points, but the current implementation in canAssignToUser doesn't use 
headroom at all and whether we need to unreserve or not on the last call to 
assignContainers doesn't affect the headroom calculation.

Again basically all we would be doing is placing an extra global variable(s) in 
the ResourceLimits class just to pass it on down a couple of functions. That to 
me is a parameter.   Now if we had multiple things needing this or updating it 
then to me fits better in the ResourceLimits.  

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --------------------------------------------------------------------------------------
>                 Key: YARN-3434
>                 URL: https://issues.apache.org/jira/browse/YARN-3434
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.6.0
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>         Attachments: YARN-3434.patch
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.

This message was sent by Atlassian JIRA

Reply via email to