[jira] [Commented] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request

Haibo Chen (JIRA) Mon, 14 May 2018 16:51:25 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16475058#comment-16475058
 ]


Haibo Chen commented on YARN-8248:
----------------------------------

{quote}as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null 
value for the AM requests,
{quote}
Good catch! It does indeed return null if the AM is unmanaged. But I am not 
sure how the debug message helps diagnose this issue. I'd prefer we remove the 
debug message
{quote} Is this explanation makes it cleaner?
{quote}
Yes. That makes sense. Comments would be very help in this case. We could also 
maybe reverse the order of the two conditions. The current diagnostic message 
seems good to me now that I understand what the condition means.
{quote} So in my understanding, it can happen that in {{addApplication()}} the 
app was not rejected, for example AM does not request vCores and we have 0 
vCores configure as max resources, but for a map container, 1 vCores is 
requested.
{quote}
Indeed, that can happen to custom resource types. In FairScheduler.allocate(), 
instead of rejecting an application if any request is rejected, we can just 
filtering out the ones that should be rejected by removing them from the ask 
list (with warning log) and proceed. Rejecting an application after it has 
starting running (FairScheduler.allocate() is called remotely by AM) seems 
counter-intuitive. I think we can signal AM by throwing a 
SchedulerInvalidResoureRequestException, which is propagated to AM. What do you 
think?
{quote}About the uncovered unit test: Good point and I was thinking about that 
if we can reject an application only if the AM request is greater than 0 and we 
have 0 configured as max resource or simply in any case where the requested 
resource is greater than max resource, regardless if it is 0 or not.
{quote}
Never mind comment 4). That's based on my previous misunderstanding. If AM 
request is large than than the non-zero max-resource (steady fair share), we 
should not reject, because the queue may get instantaneous fair share that is 
large enough. That's not related to this patch.

 

Let me know if something does not make sense.

 

 

 

> Job hangs when a queue is specified and the maxResources of the queue cannot 
> satisfy the AM resource request
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-8248
>                 URL: https://issues.apache.org/jira/browse/YARN-8248
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler, yarn
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: YARN-8248-001.patch, YARN-8248-002.patch, 
> YARN-8248-003.patch, YARN-8248-004.patch, YARN-8248-005.patch, 
> YARN-8248-006.patch
>
>
> Job hangs when mapreduce.job.queuename is specified and the queue has 0 of 
> any resource (vcores / memory / other)
> In this scenario, the job should be immediately rejected upon submission 
> since the specified queue cannot serve the resource needs of the submitted 
> job.
>  
> Command to run:
> {code:java}
> bin/yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" 
> pi -Dmapreduce.job.queuename=sample_queue 1 1000;{code}
> fair-scheduler.xml queue config (excerpt):
>  
> {code:java}
>  <queue name="sample_queue">
>     <minResources>10000 mb,0vcores</minResources>
>     <maxResources>90000 mb,0vcores</maxResources>
>     <maxRunningApps>50</maxRunningApps>
>     <maxAMShare>-1.0f</maxAMShare>
>     <weight>2.0</weight>
>     <schedulingPolicy>fair</schedulingPolicy>
>   </queue>
> {code}
> Diagnostic message from the web UI: 
> {code:java}
> Wed May 02 06:35:57 -0700 2018] Application is added to the scheduler and is 
> not yet activated. (Resource request: <memory:1536, vCores:1> exceeds current 
> queue or its parents maximum resource allowed).{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request

Reply via email to