[
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinod Kumar Vavilapalli updated YARN-1857:
------------------------------------------
Issue Type: Sub-task (was: Bug)
Parent: YARN-1198
> CapacityScheduler headroom doesn't account for other AM's running
> -----------------------------------------------------------------
>
> Key: YARN-1857
> URL: https://issues.apache.org/jira/browse/YARN-1857
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: capacityscheduler
> Affects Versions: 2.3.0
> Reporter: Thomas Graves
>
> Its possible to get an application to hang forever (or a long time) in a
> cluster with multiple users. The reason why is that the headroom sent to the
> application is based on the user limit but it doesn't account for other
> Application masters using space in that queue. So the headroom (user limit -
> user consumed) can be > 0 even though the cluster is 100% full because the
> other space is being used by application masters from other users.
> For instance if you have a cluster with 1 queue, user limit is 100%, you have
> multiple users submitting applications. One very large application by user 1
> starts up, runs most of its maps and starts running reducers. other users try
> to start applications and get their application masters started but not
> tasks. The very large application then gets to the point where it has
> consumed the rest of the cluster resources with all reduces. But at this
> point it needs to still finish a few maps. The headroom being sent to this
> application is only based on the user limit (which is 100% of the cluster
> capacity) its using lets say 95% of the cluster for reduces and then other 5%
> is being used by other users running application masters. The MRAppMaster
> thinks it still has 5% so it doesn't know that it should kill a reduce in
> order to run a map.
> This can happen in other scenarios also. Generally in a large cluster with
> multiple queues this shouldn't cause a hang forever but it could cause the
> application to take much longer.
--
This message was sent by Atlassian JIRA
(v6.2#6252)