[jira] [Commented] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request

2018-05-14 Thread Haibo Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475058#comment-16475058
 ] 

Haibo Chen commented on YARN-8248:
--

{quote}as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null 
value for the AM requests,
{quote}
Good catch! It does indeed return null if the AM is unmanaged. But I am not 
sure how the debug message helps diagnose this issue. I'd prefer we remove the 
debug message
{quote} Is this explanation makes it cleaner?
{quote}
Yes. That makes sense. Comments would be very help in this case. We could also 
maybe reverse the order of the two conditions. The current diagnostic message 
seems good to me now that I understand what the condition means.
{quote} So in my understanding, it can happen that in {{addApplication()}} the 
app was not rejected, for example AM does not request vCores and we have 0 
vCores configure as max resources, but for a map container, 1 vCores is 
requested.
{quote}
Indeed, that can happen to custom resource types. In FairScheduler.allocate(), 
instead of rejecting an application if any request is rejected, we can just 
filtering out the ones that should be rejected by removing them from the ask 
list (with warning log) and proceed. Rejecting an application after it has 
starting running (FairScheduler.allocate() is called remotely by AM) seems 
counter-intuitive. I think we can signal AM by throwing a 
SchedulerInvalidResoureRequestException, which is propagated to AM. What do you 
think?
{quote}About the uncovered unit test: Good point and I was thinking about that 
if we can reject an application only if the AM request is greater than 0 and we 
have 0 configured as max resource or simply in any case where the requested 
resource is greater than max resource, regardless if it is 0 or not.
{quote}
Never mind comment 4). That's based on my previous misunderstanding. If AM 
request is large than than the non-zero max-resource (steady fair share), we 
should not reject, because the queue may get instantaneous fair share that is 
large enough. That's not related to this patch.

 

Let me know if something does not make sense.

 

 

 

> Job hangs when a queue is specified and the maxResources of the queue cannot 
> satisfy the AM resource request
> 
>
> Key: YARN-8248
> URL: https://issues.apache.org/jira/browse/YARN-8248
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, yarn
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8248-001.patch, YARN-8248-002.patch, 
> YARN-8248-003.patch, YARN-8248-004.patch, YARN-8248-005.patch, 
> YARN-8248-006.patch
>
>
> Job hangs when mapreduce.job.queuename is specified and the queue has 0 of 
> any resource (vcores / memory / other)
> In this scenario, the job should be immediately rejected upon submission 
> since the specified queue cannot serve the resource needs of the submitted 
> job.
>  
> Command to run:
> {code:java}
> bin/yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" 
> pi -Dmapreduce.job.queuename=sample_queue 1 1000;{code}
> fair-scheduler.xml queue config (excerpt):
>  
> {code:java}
>  
> 1 mb,0vcores
> 9 mb,0vcores
> 50
> -1.0f
> 2.0
> fair
>   
> {code}
> Diagnostic message from the web UI: 
> {code:java}
> Wed May 02 06:35:57 -0700 2018] Application is added to the scheduler and is 
> not yet activated. (Resource request:  exceeds current 
> queue or its parents maximum resource allowed).{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request

2018-05-13 Thread Szilard Nemeth (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473426#comment-16473426
 ] 

Szilard Nemeth commented on YARN-8248:
--

Hi @haibo!

Thanks for your comments!

1. I'm fine with removing the first null check:
{code:java}
 if (rmApp == null || rmApp.getAMResourceRequests() == null) {
LOG.debug("rmApp or rmApp.AMResourceRequests was null!");
  }
{code}
but as {{RMAppManager.validateAndCreateResourceRequest()}} can return a null 
value for the AM requests, I would leave the second null check that is just 
before the loop on amRequests:
{code:java}
if (rmApp != null && rmApp.getAMResourceRequests() != null) {
{code}
Maybe it could be just if
{code:java}
(rmApp.getAMResourceRequests != null) 
{code}
since rmApp should be non-null at this point.
 What do you prefer?

 

2. It is true that 
{{Resources.fitsIn(amResourceRequest.getCapability(),queueMaxShare)}} would 
always return false when the {{queueMaxShare}} is 0 for any resource, but the 
problem with just using {{Resources.fitsIn}} is that it would return false for 
such cases when the requested resource is smaller than the max resource but 
that max resource is not zero, e.g. requested vCores = 2, max vCores = 1.
 With this check, I only wanted to catch those cases where there is a resource 
request of any resource type but the queue has 0 of that resource in 
{{queueMaxShare}}.
 In this sense, in the if condition this check would be enough:
{code:java}
Resources.isAnyMajorResourceZero(DOMINANT_RESOURCE_CALCULATOR, queueMaxShare)
{code}
but it is not perfectly fine since only using this check does not check whether 
a resource is really requested. For example, an application does not request 
any vCores (maybe this cannot happen in reality) and we have 0 of vCores as 
maximum then it is a perfectly reasonable request so we don't need to reject 
the application. On the other hand if an app requests 1 vCores and we have 0 
vCores as maximum then rejection should happen.
 Is this explanation makes it cleaner?
 Do you think some comments need to be added to the code above the if condition?
 How would you update the diagnostic message?

 

3. My overall intention of my changes in {{Fairscheduler}} was the following: 
 Essentially, in {{addApplication()}}, the AM resource requests are checked 
against the queue's max resources.
 In {{allocate()}}, I check whether any container allocation (e.g. map/reduce) 
resource request is happened against a queue that has 0 of any resource 
configured as max resource.
 So in my understanding, it can happen that in {{addApplication()}} the app was 
not rejected, for example AM does not request vCores and we have 0 vCores 
configure as max resources, but for a map container, 1 vCores is requested. 
 Please tell me whether this is clear.

 

4. 
 {{testAppRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection of an 
application happens when AM resource request is exceeding the queue's maximum 
resources. (tests code added to {{FairScheduler.addApplication}})

{{testSchedulingRejectedToQueueZeroCapacityOfResource()}}: Tests if rejection 
of an application happens when map / reduce container request is exceeding the 
queue's maximum resources (tests code added to {{FairScheduler.allocate}})
 Please check my comment for 3. as I explained such a case when an application 
will not be rejected immadiately upon submission but only when map/reduce 
container request happens.

About the uncovered unit test: Good point and I was thinking about that if we 
can reject an application only if the AM request is greater than 0 and we have 
0 configured as max resource or simply in any case where the requested resource 
is greater than max resource, regardless if it is 0 or not.

If the latter is true, then I agree, unit tests and the if-conditions in the 
production code needs to be changed accordingly (using just 
{{Resources.fitsIn}} will work I guess).

I'm fine with either way as well and as you have competence with FairScheduler 
please advise which way I should go.

5.
 - Removed the unused import.
 - Renamed those methods what you suggested
 - Thanks for the log change suggestions, you were right about those, it's way 
more understandable that way.

 

Thanks!

> Job hangs when a queue is specified and the maxResources of the queue cannot 
> satisfy the AM resource request
> 
>
> Key: YARN-8248
> URL: https://issues.apache.org/jira/browse/YARN-8248
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, yarn
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8248-001.patch, YARN-8248-002.patch, 
> YARN-8248-003.patch, YARN-8248-004.patch, 

[jira] [Commented] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request

2018-05-11 Thread Haibo Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472598#comment-16472598
 ] 

Haibo Chen commented on YARN-8248:
--

Thanks [~snemeth] for updating the patch. I have a few more comments/questions

1) The AMResourceRequests of an application is already verified in 
RMAppManager. validateAndCreateResourceRequest(). We don't need to check what 
if it is null, inside fair scheduler any more, IMO. Effectively, what I am 
suggesting is

removing
{code:java}
  if (rmApp == null || rmApp.getAMResourceRequests() == null) {
    LOG.debug("rmApp or rmApp.AMResourceRequests was null!");
  }
{code}
2) Isn't Resources.isAnyMajorResourceZero(DOMINANT_RESOURCE_CALCULATOR, 
queueMaxShare)), already included in 
!Resources.fitsIn(amResourceRequest.getCapability(),queueMaxShare)? That is, if 
the queue max resource is 0, 
Resources.fitsIn(amResourceRequest.getCapability(),queueMaxShare) would always 
return false. We'd also need to update the diagnostic message accordingly.

3) We don't need to check again in FairScheduler.allocate() because it is 
always called after the APP is accepted, which would imply the check already 
passed.

4) It is not clear to me how  testAppRejectedToQueueZeroCapacityOfResource() is 
different from  testSchedulingRejectedToQueueZeroCapacityOfResource().  The 
former case includes the latter, doesn't it? If so, I'd propose we get rid of 
testSchedulingRejectedToQueueZeroCapacityOfResource() and associated tests. 
There is one other case not covered in unit tests. What if max Resource of a 
queue is not zero, but the AM resource request is larger than the maxResource?

5) Some minor issues:

There is an unused import in FairScheduler.java;

Let's rename  processEvents()-> addApplication(),  processAttempAddedEvent() -> 
addAppAttempt();

Some debug messages tend to describe what the code does. Interpreting the debug 
log without the code aside can be hard.  A few suggestions:

LOG.debug("Assignment of container on node " + node+ " is zero!");  -> 
LOG.debug("No container is allocated on node" + node);

"Resource ask %s fits in available node resources %s, but the allocated 
container was null!" -> "Resource ask %s fits in available node resources %s, 
but no container was allocated"

LOG.debug("Assign container precheck was false on node: " + node); -> 
LOG.debug("Assign container precheck on node " + node + " failed" );

 

> Job hangs when a queue is specified and the maxResources of the queue cannot 
> satisfy the AM resource request
> 
>
> Key: YARN-8248
> URL: https://issues.apache.org/jira/browse/YARN-8248
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, yarn
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8248-001.patch, YARN-8248-002.patch, 
> YARN-8248-003.patch, YARN-8248-004.patch, YARN-8248-005.patch, 
> YARN-8248-006.patch
>
>
> Job hangs when mapreduce.job.queuename is specified and the queue has 0 of 
> any resource (vcores / memory / other)
> In this scenario, the job should be immediately rejected upon submission 
> since the specified queue cannot serve the resource needs of the submitted 
> job.
>  
> Command to run:
> {code:java}
> bin/yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" 
> pi -Dmapreduce.job.queuename=sample_queue 1 1000;{code}
> fair-scheduler.xml queue config (excerpt):
>  
> {code:java}
>  
> 1 mb,0vcores
> 9 mb,0vcores
> 50
> -1.0f
> 2.0
> fair
>   
> {code}
> Diagnostic message from the web UI: 
> {code:java}
> Wed May 02 06:35:57 -0700 2018] Application is added to the scheduler and is 
> not yet activated. (Resource request:  exceeds current 
> queue or its parents maximum resource allowed).{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8248) Job hangs when a queue is specified and the maxResources of the queue cannot satisfy the AM resource request

2018-05-11 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472587#comment-16472587
 ] 

genericqa commented on YARN-8248:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
39s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 39s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 31s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 1 new + 246 unchanged - 0 fixed = 247 total (was 246) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 35s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 70m 
57s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
22s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}124m 55s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd |
| JIRA Issue | YARN-8248 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12923072/YARN-8248-006.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 8d5e00debded 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 
11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / c1d64d6 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_162 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/20704/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/20704/testReport/ |
| Max. process+thread count | 874 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: