Steven Rand created YARN-6960:
---------------------------------

             Summary: definition of active queue allows idle long-running apps 
to distort fair shares
                 Key: YARN-6960
                 URL: https://issues.apache.org/jira/browse/YARN-6960
             Project: Hadoop YARN
          Issue Type: Bug
          Components: fairscheduler
    Affects Versions: 3.0.0-alpha4, 2.8.1
            Reporter: Steven Rand
            Assignee: Steven Rand


YARN-2026 introduced the notion of only considering active queues when 
computing the fair share of each queue. The definition of an active queue is a 
queue with at least one runnable app:

{code}
  public boolean isActive() {
    return getNumRunnableApps() > 0;
  }
{code}

One case that this definition of activity doesn't account for is that of 
long-running applications that scale dynamically. Such an application might 
request many containers when jobs are running, but scale down to very few 
containers, or only the AM container, when no jobs are running.

Even when such an application has scaled down to a negligible amount of demand 
and utilization, the queue that it's in is still considered to be active, which 
defeats the purpose of YARN-2026. For example, consider this scenario:

1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of 
which have the same weight.
2. Queues {{root.a}} and {{root.b}} contain long-running applications that 
currently have only one container each (the AM).
3. An application in queue {{root.c}} starts, and uses the whole cluster except 
for the small amount in use by {{root.a}} and {{root.b}}. An application in 
{{root.d}} starts, and has a high enough demand to be able to use half of the 
cluster. Because all four queues are active, the app in {{root.d}} can only 
preempt the app in {{root.c}} up to roughly 25% of the cluster's resources, 
while the app in {{root.c}} keeps about 75%.

Ideally in this example, the app in {{root.d}} would be able to preempt the app 
in {{root.c}} up to 50% of the cluster, which would be possible if the idle 
apps in {{root.a}} and {{root.b}} didn't cause those queues to be considered 
active.

One way to address this is to update the definition of an active queue to be a 
queue containing 1 or more non-AM containers. This way if all apps in a queue 
scale down to only the AM, other queues' fair shares aren't affected.

The benefit of this approach is that it's quite simple. The downside is that it 
doesn't account for apps that are idle and using almost no resources, but still 
have at least one non-AM container.

There are a couple of other options that seem plausible to me, but they're much 
more complicated, and it seems to me that this proposal makes good progress 
while adding minimal extra complexity.

Does this seem like a reasonable change? I'm certainly open to better ideas as 
well.

Thanks,
Steve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to