[jira] [Commented] (YARN-5545) App submit failure on queue with label when default queue partition capacity is zero

Sunil G (JIRA) Wed, 14 Sep 2016 22:21:07 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-5545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492418#comment-15492418
 ]


Sunil G commented on YARN-5545:
-------------------------------

Thank you very much [~jlowe] for pitching in and sharing thoughts. Makes sense 
to me overall.

bq.if we go down this route then I think we should have a separate top-level 
config that, when set, specifies the default max-apps per queue explicitly 
rather than having them try to derive it based on relative capacities. We can 
then debate whether that also overrides the system-wide setting or if we still 
respect the system-wide limit.

IIUC, existing {{yarn.scheduler.capacity.maximum-applications}} can be used for 
system level max-limit for apps, and proposing new config like 
{{yarn.scheduler.capacity.global.queue-level-maximum-applications}}. This could 
be configured to set max-apps per queue level  in cluster level (queue won’t 
override this). So if I set this config as 10k, then any queue in best case 
could atmost submit 10k apps. And this will also work along with system-wide 
app limit. Hence if we configure system-wide app limit as 50k, and assuming we 
have 10queues (10k each limit), we will not end up in having 100k apps in 
cluster. Rather we will hit system-wide limit of 50k.

As more queues are added to system, admin can decrease global queue 
max-app-limit for better fine tuning if needed. If we are tending to use global 
queue max-app-limit as a relaxed boundary, then strict actions (reject an app) 
can be done based on system-wide limit. But if we are configuring this limit 
more judiciously, we can think of making same queue max-app-limit also as 
strict limit to reject apps for a queue.
I see only one problem for this now. If  we are not making {{Q * X ~ Y}} (where 
Q is number of queues, X is global per-queue limit and Y is system-wide max-app 
limit) as a strict rule, then we have 2 possibilities {{Q * X > Y}} and {{Q * X 
< Y}}.  I think mostly admin prefer to use former approach, where system-wide 
limit will be stricter and a relaxed limit for per-queue limit. But if we use 
latter, then we may reject apps even though system-wide limit is still not met. 
This may or may not be fine. I think with more discussion we can come to a 
common consensus here. 

> App submit failure on queue with label when default queue partition capacity 
> is zero
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-5545
>                 URL: https://issues.apache.org/jira/browse/YARN-5545
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>         Attachments: YARN-5545.0001.patch, YARN-5545.0002.patch, 
> YARN-5545.0003.patch, capacity-scheduler.xml
>
>
> Configure capacity scheduler 
> yarn.scheduler.capacity.root.default.capacity=0
> yarn.scheduler.capacity.root.queue1.accessible-node-labels.labelx.capacity=50
> yarn.scheduler.capacity.root.default.accessible-node-labels.labelx.capacity=50
> Submit application as below
> ./yarn jar 
> ../share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-alpha2-SNAPSHOT-tests.jar
>  sleep -Dmapreduce.job.node-label-expression=labelx 
> -Dmapreduce.job.queuename=default -m 1 -r 1 -mt 10000000 -rt 1
> {noformat}
> 2016-08-21 18:21:31,375 INFO mapreduce.JobSubmitter: Cleaning up the staging 
> area /tmp/hadoop-yarn/staging/root/.staging/job_1471670113386_0001
> java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed 
> to submit application_1471670113386_0001 to YARN : 
> org.apache.hadoop.security.AccessControlException: Queue root.default already 
> has 0 applications, cannot accept submission of application: 
> application_1471670113386_0001
>       at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:316)
>       at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:255)
>       at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1344)
>       at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1790)
>       at org.apache.hadoop.mapreduce.Job.submit(Job.java:1341)
>       at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1362)
>       at org.apache.hadoop.mapreduce.SleepJob.run(SleepJob.java:273)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>       at org.apache.hadoop.mapreduce.SleepJob.main(SleepJob.java:194)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:497)
>       at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
>       at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
>       at 
> org.apache.hadoop.test.MapredTestDriver.run(MapredTestDriver.java:136)
>       at 
> org.apache.hadoop.test.MapredTestDriver.main(MapredTestDriver.java:144)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:497)
>       at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>       at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit 
> application_1471670113386_0001 to YARN : 
> org.apache.hadoop.security.AccessControlException: Queue root.default already 
> has 0 applications, cannot accept submission of application: 
> application_1471670113386_0001
>       at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:286)
>       at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:296)
>       at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:301)
>       ... 25 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-5545) App submit failure on queue with label when default queue partition capacity is zero

Reply via email to