Thomas Graves commented on YARN-3434:

The code you mention is in the else part of that check where it would do a 
reservation.  The situation I'm talking about actually allocates a container, 
not reserve one.  I'll try to explain better:

Application ask for lots of containers. It acquires some containers, then it 
reserves some. At this point it hits its normal user limit which in my example 
= capacity.  It hasn't hit the max amount if can allocate or reserved 
(shouldAllocOrReserveNewContainer()).  The next node heartbeats in that isn't 
yet reserved and has enough space for it to place a container on.  It first 
checked in assignContainers -> canAssignToThisQueue.  That passes since we 
haven't hit max capacity. Then it checks assignContainers -> canAssignToUser. 
That passes but only because used - reserved < the user limit.  This allows it 
to continue down into assignContainer.  In assignContainer the node has 
available space and we haven't hit shouldAllocOrReserveNewContainer(). 
reservationsContinueLooking is on and labels are empty so it does the check:

if (!shouldAllocOrReserveNewContainer
            || Resources.greaterThan(resourceCalculator, clusterResource,
                minimumUnreservedResource, Resources.none()))

as I said before its allowed to allocate or reserve so it passes that test.  
Then it hasn't met its maximum capacity (capacity = 30% and max capacity = 
100%) yet so that is None and that check doesn't kick in, so it doesn't go into 
the block to findNodeToUnreserve().   Then it goes ahead and allocates when it 
should have needed to unreserve.  Basically we needed to also do the user limit 
check again and force it to do the findNodeToUnreserve. 

> Interaction between reservations and userlimit can result in significant ULF 
> violation
> --------------------------------------------------------------------------------------
>                 Key: YARN-3434
>                 URL: https://issues.apache.org/jira/browse/YARN-3434
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.6.0
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>         Attachments: YARN-3434.patch
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 
> containers, each 8G each, within about 5 seconds. I think this allowed the 
> logic in assignToUser() to allow the userlimit to be surpassed.

This message was sent by Atlassian JIRA

Reply via email to