[ https://issues.apache.org/jira/browse/YARN-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650572#comment-14650572 ]
Carlo Curino commented on YARN-4003: ------------------------------------ [~leftnoteasy], this sounds like a good proposal to "tighten" this limit a bit more. Talking with [~atumanov], however we spotted a possible corner case I want your opinion on. Given a cluster of 1000 machines, and a PlanQueue of 50%. We might have the following set of reservations: R1 and R2 both of size 250 at a certain time t_0, and R3 which has size 0 at t0 (and will grow for some t_i > t_0). Formally the capacity of the PlanQueue (500 containers) is exhausted by R1, R2 and R3 has capacity=0, so your math would yield to amLimit for R3 of 0 (i.e., no app can be started). However, R1 and R2 might be using only a fraction of their reserved capacity, and we might thus waste some resources. In this scenario, I would probably prefer R3 to get started opportunistically (and if R1,R2 demand does not spike till t_i where R3 capacity grows to >0 we are golden). We could clearly construct other scenario in which letting the AM to start will only mean we need to preempted as R1,R2 spike. This is a balancing act of "work preservation" vs "guaranteed execution". I am ok to resolve it in either direction, what's your vote? (Anyone else with opinions on this?) > ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is > not consistent > -------------------------------------------------------------------------------------------- > > Key: YARN-4003 > URL: https://issues.apache.org/jira/browse/YARN-4003 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Reporter: Carlo Curino > Attachments: YARN-4003.patch > > > The inherited behavior from LeafQueue (limit AM % based on capacity) is not a > good fit for ReservationQueue (that have highly dynamic capacity). -- This message was sent by Atlassian JIRA (v6.3.4#6332)