[ https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steven Rand updated YARN-6960: ------------------------------ Attachment: YARN-6960.001.patch > definition of active queue allows idle long-running apps to distort fair > shares > ------------------------------------------------------------------------------- > > Key: YARN-6960 > URL: https://issues.apache.org/jira/browse/YARN-6960 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler > Affects Versions: 2.8.1, 3.0.0-alpha4 > Reporter: Steven Rand > Assignee: Steven Rand > Attachments: YARN-6960.001.patch > > > YARN-2026 introduced the notion of only considering active queues when > computing the fair share of each queue. The definition of an active queue is > a queue with at least one runnable app: > {code} > public boolean isActive() { > return getNumRunnableApps() > 0; > } > {code} > One case that this definition of activity doesn't account for is that of > long-running applications that scale dynamically. Such an application might > request many containers when jobs are running, but scale down to very few > containers, or only the AM container, when no jobs are running. > Even when such an application has scaled down to a negligible amount of > demand and utilization, the queue that it's in is still considered to be > active, which defeats the purpose of YARN-2026. For example, consider this > scenario: > 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of > which have the same weight. > 2. Queues {{root.a}} and {{root.b}} contain long-running applications that > currently have only one container each (the AM). > 3. An application in queue {{root.c}} starts, and uses the whole cluster > except for the small amount in use by {{root.a}} and {{root.b}}. An > application in {{root.d}} starts, and has a high enough demand to be able to > use half of the cluster. Because all four queues are active, the app in > {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the > cluster's resources, while the app in {{root.c}} keeps about 75%. > Ideally in this example, the app in {{root.d}} would be able to preempt the > app in {{root.c}} up to 50% of the cluster, which would be possible if the > idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be > considered active. > One way to address this is to update the definition of an active queue to be > a queue containing 1 or more non-AM containers. This way if all apps in a > queue scale down to only the AM, other queues' fair shares aren't affected. > The benefit of this approach is that it's quite simple. The downside is that > it doesn't account for apps that are idle and using almost no resources, but > still have at least one non-AM container. > There are a couple of other options that seem plausible to me, but they're > much more complicated, and it seems to me that this proposal makes good > progress while adding minimal extra complexity. > Does this seem like a reasonable change? I'm certainly open to better ideas > as well. > Thanks, > Steve -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org