Jian He commented on YARN-1857:

thanks [~airbots] and [~cwelch], patch looks good overall, few comments and 
- Indentation of the last line seems incorrect.
    Resource headroom =
      Resources.min(resourceCalculator, clusterResource,
            Resources.min(resourceCalculator, clusterResource, 
                userLimit, queueMaxCap), 
Resources.subtract(queueMaxCap, usedResources));
- Test case2: could you check app2 headRoom as well
- Test case3: could you check app_1 headRoom as well.
- Could you explain why in test case 4 {{assertEquals(5*GB, 
app_4.getHeadroom().getMemory());}}, app4 still has 5GB headRoom?

> CapacityScheduler headroom doesn't account for other AM's running
> -----------------------------------------------------------------
>                 Key: YARN-1857
>                 URL: https://issues.apache.org/jira/browse/YARN-1857
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: capacityscheduler
>    Affects Versions: 2.3.0
>            Reporter: Thomas Graves
>            Assignee: Chen He
>            Priority: Critical
>         Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, 
> YARN-1857.patch, YARN-1857.patch, YARN-1857.patch
> Its possible to get an application to hang forever (or a long time) in a 
> cluster with multiple users.  The reason why is that the headroom sent to the 
> application is based on the user limit but it doesn't account for other 
> Application masters using space in that queue.  So the headroom (user limit - 
> user consumed) can be > 0 even though the cluster is 100% full because the 
> other space is being used by application masters from other users.  
> For instance if you have a cluster with 1 queue, user limit is 100%, you have 
> multiple users submitting applications.  One very large application by user 1 
> starts up, runs most of its maps and starts running reducers. other users try 
> to start applications and get their application masters started but not 
> tasks.  The very large application then gets to the point where it has 
> consumed the rest of the cluster resources with all reduces.  But at this 
> point it needs to still finish a few maps.  The headroom being sent to this 
> application is only based on the user limit (which is 100% of the cluster 
> capacity) its using lets say 95% of the cluster for reduces and then other 5% 
> is being used by other users running application masters.  The MRAppMaster 
> thinks it still has 5% so it doesn't know that it should kill a reduce in 
> order to run a map.  
> This can happen in other scenarios also.  Generally in a large cluster with 
> multiple queues this shouldn't cause a hang forever but it could cause the 
> application to take much longer.

This message was sent by Atlassian JIRA

Reply via email to