[jira] [Updated] (YARN-10448) SLS should set default user to handle SYNTH format

2020-09-29 Thread Chengbing Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengbing Liu updated YARN-10448:
-
Description: 
When using the synthetic generator json file example from the doc ( 
https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format
 ), it throws the following exception:

{noformat}
java.lang.IllegalArgumentException: Null user
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
at 
org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
at 
org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
at org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
{noformat}

So the solution is either:
1) to make {{user_name}} a mandatory field, or
2) to set default user in SLS code if the json file does not define it.

IMO, solution 2 might be better, because in most cases (if not all) 
{{user_name}} has no impact on scheduler performance, thus it is reasonable to 
make it an optional field, which is also consistent with the {{job.user}} field 
in SLS JSON file.


  was:
java.lang.IllegalArgumentException: Null user
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
at 
org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
at 
org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
at org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)


> SLS should set default user to handle SYNTH format
> --
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch, YARN-10448.002.patch
>
>
> When using the synthetic generator json file example from the doc ( 
> https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format
>  ), it throws the following exception:
> {noformat}
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {noformat}
> So the solution is either:
> 1) to make {{user_name}} a mandatory field, or
> 2) to set default user in SLS code if the json file does not define it.
> IMO, solution 2 might be better, because in most cases (if not all) 
> {{user_name}} has no impact on scheduler performance, thus it is reasonable 
> to make it an optional field, which is also consistent with the {{job.user}} 
> field in SLS JSON file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-29 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203783#comment-17203783
 ] 

Peter Bacsko commented on YARN-10447:
-

[~bteke] / [~snemeth] / [~adam.antal] could you pls review & commit?

> TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
> -
>
> Key: YARN-10447
> URL: https://issues.apache.org/jira/browse/YARN-10447
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10447-001.patch, YARN-10447-002.patch
>
>
> YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
> all of them. Occasionally it's still possible to receive an exception from 
> Mockito and the two following stack traces can be observed in the console:
> {noformat}
>   
>   org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
> or
> {noformat}
> 2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= vCores:1>
> 2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= vCores:1>
> Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.lang.Boolean
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It's probably best to disable ActivitiesManager thread entirely in this test 
> class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10448) SLS should set default user to handle SYNTH format

2020-09-29 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203789#comment-17203789
 ] 

zhuqi commented on YARN-10448:
--

CC [~cheersyang] [~Tao Yang]   

Attached the 002 patch: set default user in \{{startAMFromSynthGenerator()}} if 
\{{job.getUser()}} is null.

Thanks.

> SLS should set default user to handle SYNTH format
> --
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch, YARN-10448.002.patch
>
>
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10448) SLS should set default user to handle SYNTH format

2020-09-29 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203817#comment-17203817
 ] 

Hadoop QA commented on YARN-10448:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
48s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to 
include any new or modified tests. Please justify why no new tests are needed 
for this patch. Also please list what manual steps were performed to verify 
this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
27s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
30s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
25s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
34s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 36s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  0m 
55s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
52s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
32s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
23s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
23s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
21s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
21s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
17s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
25s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m  9s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green}{color} | {color:green} the patch passed with JDK 

[jira] [Comment Edited] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-29 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203975#comment-17203975
 ] 

Adam Antal edited comment on YARN-10447 at 9/29/20, 2:22 PM:
-

Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{this.activitiesManagerEnabled = enabled;}} there, is that 
right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?


was (Author: adam.antal):
Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{this.activitiesManagerEnabled = enabled; }} there, is that 
right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?

> TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
> -
>
> Key: YARN-10447
> URL: https://issues.apache.org/jira/browse/YARN-10447
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10447-001.patch, YARN-10447-002.patch
>
>
> YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
> all of them. Occasionally it's still possible to receive an exception from 
> Mockito and the two following stack traces can be observed in the console:
> {noformat}
>   
>   org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
> or
> {noformat}
> 2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= vCores:1>
> 2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= vCores:1>
> Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.lang.Boolean
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It's probably best to disable ActivitiesManager thread entirely in this test 
> class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-29 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203975#comment-17203975
 ] 

Adam Antal edited comment on YARN-10447 at 9/29/20, 2:21 PM:
-

Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{  this.activitiesManagerEnabled = enabled; }} there, is 
that right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?


was (Author: adam.antal):
Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{  this.activitiesManagerEnabled = enabled;}} there, is that 
right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?

> TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
> -
>
> Key: YARN-10447
> URL: https://issues.apache.org/jira/browse/YARN-10447
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10447-001.patch, YARN-10447-002.patch
>
>
> YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
> all of them. Occasionally it's still possible to receive an exception from 
> Mockito and the two following stack traces can be observed in the console:
> {noformat}
>   
>   org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
> or
> {noformat}
> 2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= vCores:1>
> 2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= vCores:1>
> Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.lang.Boolean
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It's probably best to disable ActivitiesManager thread entirely in this test 
> class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-29 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203975#comment-17203975
 ] 

Adam Antal edited comment on YARN-10447 at 9/29/20, 2:21 PM:
-

Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{this.activitiesManagerEnabled = enabled; }} there, is that 
right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?


was (Author: adam.antal):
Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{  this.activitiesManagerEnabled = enabled; }} there, is 
that right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?

> TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
> -
>
> Key: YARN-10447
> URL: https://issues.apache.org/jira/browse/YARN-10447
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10447-001.patch, YARN-10447-002.patch
>
>
> YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
> all of them. Occasionally it's still possible to receive an exception from 
> Mockito and the two following stack traces can be observed in the console:
> {noformat}
>   
>   org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
> or
> {noformat}
> 2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= vCores:1>
> 2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= vCores:1>
> Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.lang.Boolean
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It's probably best to disable ActivitiesManager thread entirely in this test 
> class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-29 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203975#comment-17203975
 ] 

Adam Antal commented on YARN-10447:
---

Thanks [~pbacsko] for the patch.

I think this patch does not work, because in {{CapacityScheduler.java}} you 
have the following piece of code:
{code:java}
public void setActivitiesManagerEnabled(boolean enabled) {
  this.activitiesManagerEnabled = true;
}
{code}
I think you meant {{  this.activitiesManagerEnabled = enabled;}} there, is that 
right?

On a sidenote, I can see a {{activitiesManager.start();}} guarded by the flag, 
but there's no {{stop()}} call. Isn't that can cause memory leak?

> TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
> -
>
> Key: YARN-10447
> URL: https://issues.apache.org/jira/browse/YARN-10447
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10447-001.patch, YARN-10447-002.patch
>
>
> YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
> all of them. Occasionally it's still possible to receive an exception from 
> Mockito and the two following stack traces can be observed in the console:
> {noformat}
>   
>   org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
> or
> {noformat}
> 2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= vCores:1>
> 2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= vCores:1>
> Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.lang.Boolean
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It's probably best to disable ActivitiesManager thread entirely in this test 
> class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2020-09-29 Thread Benjamin Teke (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204035#comment-17204035
 ] 

Benjamin Teke commented on YARN-8737:
-

Hi [~Tao Yang],

Was there any particular reason this patch wasn't merged? Can we help in any 
way?

Thanks!

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0, 2.9.3, 3.2.2, 3.1.4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-09-29 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10450:
---
Attachment: YARN-10450.001.patch

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: NodesPage.png, YARN-10450.001.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10413) Change fs2cs to generate mapping rules in the new format

2020-09-29 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10413:
--
Fix Version/s: 3.4.0

> Change fs2cs to generate mapping rules in the new format
> 
>
> Key: YARN-10413
> URL: https://issues.apache.org/jira/browse/YARN-10413
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10413-001.patch, YARN-10413-002.patch, 
> YARN-10413-003.patch, YARN-10413-004.patch, YARN-10413-005.patch
>
>
> Currently, the converter tool {{fs2cs}} can convert placement rules to 
> mapping rules, but the differences are too big.
> It should be modified to generate placement rules to the new engine and 
> output it to a separate JSON file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-09-29 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204266#comment-17204266
 ] 

Jim Brennan commented on YARN-10450:


Attached an image showing the added UI elements (this is for the legacy UI).


> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: NodesPage.png
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-09-29 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10450:
---
Attachment: NodesPage.png

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: NodesPage.png
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10413) Change fs2cs to generate mapping rules in the new format

2020-09-29 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204187#comment-17204187
 ] 

Szilard Nemeth commented on YARN-10413:
---

Thanks [~pbacsko],

Latest patch LGTM, committed to trunk.

Resolving jira.

> Change fs2cs to generate mapping rules in the new format
> 
>
> Key: YARN-10413
> URL: https://issues.apache.org/jira/browse/YARN-10413
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10413-001.patch, YARN-10413-002.patch, 
> YARN-10413-003.patch, YARN-10413-004.patch, YARN-10413-005.patch
>
>
> Currently, the converter tool {{fs2cs}} can convert placement rules to 
> mapping rules, but the differences are too big.
> It should be modified to generate placement rules to the new engine and 
> output it to a separate JSON file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-29 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reassigned YARN-10393:
--

Assignee: Jim Brennan  (was: zhenzhao wang)

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-29 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204272#comment-17204272
 ] 

Jim Brennan commented on YARN-10393:


Thanks [~wzzdreamer]!  I will put up a full patch soon.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at 

[jira] [Updated] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-29 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10447:

Attachment: YARN-10447-003.patch

> TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
> -
>
> Key: YARN-10447
> URL: https://issues.apache.org/jira/browse/YARN-10447
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10447-001.patch, YARN-10447-002.patch, 
> YARN-10447-003.patch
>
>
> YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
> all of them. Occasionally it's still possible to receive an exception from 
> Mockito and the two following stack traces can be observed in the console:
> {noformat}
>   
>   org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
> or
> {noformat}
> 2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= vCores:1>
> 2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= vCores:1>
> Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.lang.Boolean
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It's probably best to disable ActivitiesManager thread entirely in this test 
> class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-29 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204161#comment-17204161
 ] 

Peter Bacsko commented on YARN-10447:
-

[~adam.antal] thanks, I fixed these mistakes.

> TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
> -
>
> Key: YARN-10447
> URL: https://issues.apache.org/jira/browse/YARN-10447
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10447-001.patch, YARN-10447-002.patch, 
> YARN-10447-003.patch
>
>
> YARN-9784 fixed some concurrency related issues in {{TestLeafQueue}}, but not 
> all of them. Occasionally it's still possible to receive an exception from 
> Mockito and the two following stack traces can be observed in the console:
> {noformat}
>   
>   org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
> or
> {noformat}
> 2020-09-22 14:44:52,584 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.3 avail= vCores:1>
> 2020-09-22 14:44:52,585 INFO  [main] capacity.TestUtils 
> (TestUtils.java:getMockNode(227)) - node = 127.0.0.4 avail= vCores:1>
> Exception in thread "ActivitiesManager thread." java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.lang.Boolean
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$$EnhancerByMockitoWithCGLIB$$272c72c5.isMultiNodePlacementEnabled()
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded(ActivitiesManager.java:266)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager.access$500(ActivitiesManager.java:63)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager$1.run(ActivitiesManager.java:347)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It's probably best to disable ActivitiesManager thread entirely in this test 
> class, there is no need for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-09-29 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10450:
--

 Summary: Add cpu and memory utilization per node and cluster-wide 
metrics
 Key: YARN-10450
 URL: https://issues.apache.org/jira/browse/YARN-10450
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.3.1
Reporter: Jim Brennan
Assignee: Jim Brennan


Add metrics to show actual cpu and memory utilization for each node and 
aggregated for the entire cluster.  This is information is already passed from 
NM to RM in the node status update.
We have been running with this internally for quite a while and found it useful 
to be able to quickly see the actual cpu/memory utilization on the 
node/cluster.  It's especially useful if some form of overcommit is used.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10447) TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing

2020-09-29 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204273#comment-17204273
 ] 

Hadoop QA commented on YARN-10447:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
38s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
41s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
53s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
0s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 22s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
41s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
55s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
53s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
54s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
57s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
57s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
49s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
49s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
46s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m  7s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
37s{color} | {color:green}{color} | 

[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-09-29 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204357#comment-17204357
 ] 

Hadoop QA commented on YARN-10450:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
26s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 3 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 
32s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
14s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
11s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
46s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
6s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
19m  2s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
46s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
24s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
23s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 8s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
11s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
11s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
59s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 39s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/202/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color}
 | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 7 new + 76 unchanged - 0 fixed = 83 total (was 76) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
1s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/202/artifact/out/whitespace-eol.txt{color}
 | {color:red} The patch has 3 line(s) that end in whitespace. Use git apply 
--whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-29 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204370#comment-17204370
 ] 

Hadoop QA commented on YARN-10393:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
45s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
 4s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
22s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
21s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m  8s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
30s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
28s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
43s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
14s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
14s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
8s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
8s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 29s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/203/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt{color}
 | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 1 new + 120 unchanged - 0 fixed = 121 total (was 120) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
39s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 52s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |

[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2

2020-09-29 Thread Sunil G (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204463#comment-17204463
 ] 

Sunil G commented on YARN-10244:


Thanks [~Steven Rand]

Lets get this in. 

> backport YARN-9848 to branch-3.2
> 
>
> Key: YARN-10244
> URL: https://issues.apache.org/jira/browse/YARN-10244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-10244-branch-3.2.001.patch, 
> YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch
>
>
> Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2020-09-29 Thread Tao Yang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204467#comment-17204467
 ] 

Tao Yang commented on YARN-8737:


Hi, [~Amithsha], [~wangda], [~bteke]. Sorry for missing this issue so long.
I haven't dug into this issue or checked if the exception never happen again (I 
have just search the key words "Comparison method violates its general 
contract" from RM logs of our YARN clusters which can only be stored for 7 
days, nothing returned) since this exception can't crash or affect the 
scheduling process in our internal versions. 
After looking into YARN-10178, I think this problem may be raised by multiple 
causes, the same point is that some resources like capacity-resource or 
used-resource in child queues(leaf or parent queue) changed while parent queue 
is sorting them. 
I think this patch can solve the problem for the configurations-updating 
scenario, adding read lock in ParentQueue#sortAndGetChildrenAllocationIterator 
can avoid the child queues' configured capacity be updated while being sorted. 
[~wangda], [~bteke] very appreciate if you can help to review and commit this 
patch.
And we should also fix the problem for the scheduling scenario in YARN-10178.

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0, 2.9.3, 3.2.2, 3.1.4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9693) When AMRMProxyService is enabled RMCommunicator will register with failure

2020-09-29 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203663#comment-17203663
 ] 

zhuqi commented on YARN-9693:
-

CC [~cane]  [~panlijie] 

If the patch fix the problem, i meet the same error too.

Thanks.

> When AMRMProxyService is enabled RMCommunicator will register with failure
> --
>
> Key: YARN-9693
> URL: https://issues.apache.org/jira/browse/YARN-9693
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9693.001.patch
>
>
> When we enable amrm proxy service, the  RMCommunicator will register with 
> failure below:
> {code:java}
> 2019-07-23 17:12:44,794 INFO [TaskHeartbeatHandler PingChecker] 
> org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler 
> thread interrupted
> 2019-07-23 17:12:44,794 ERROR [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid 
> AMRMToken from appattempt_1563872237585_0001_02
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:186)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:123)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:280)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStart(MRAppMaster.java:986)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1300)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$6.run(MRAppMaster.java:1768)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1764)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1698)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: 
> Invalid AMRMToken from appattempt_1563872237585_0001_02
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>   at com.sun.proxy.$Proxy93.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:170)
>   ... 14 more
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  Invalid AMRMToken 

[jira] [Updated] (YARN-7494) Add multi-node lookup mechanism and pluggable nodes sorting policies to optimize placement decision

2020-09-29 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-7494:
-
Summary: Add multi-node lookup mechanism and pluggable nodes sorting 
policies to optimize placement decision  (was: Add muti-node lookup mechanism 
and pluggable nodes sorting policies to optimize placement decision)

> Add multi-node lookup mechanism and pluggable nodes sorting policies to 
> optimize placement decision
> ---
>
> Key: YARN-7494
> URL: https://issues.apache.org/jira/browse/YARN-7494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7494.001.patch, YARN-7494.002.patch, 
> YARN-7494.003.patch, YARN-7494.004.patch, YARN-7494.005.patch, 
> YARN-7494.006.patch, YARN-7494.007.patch, YARN-7494.008.patch, 
> YARN-7494.009.patch, YARN-7494.010.patch, YARN-7494.11.patch, 
> YARN-7494.12.patch, YARN-7494.13.patch, YARN-7494.14.patch, 
> YARN-7494.15.patch, YARN-7494.16.patch, YARN-7494.17.patch, 
> YARN-7494.18.patch, YARN-7494.19.patch, YARN-7494.20.patch, 
> YARN-7494.v0.patch, YARN-7494.v1.patch, multi-node-designProposal.png
>
>
> Instead of single node, for effectiveness we can consider a multi node lookup 
> based on partition to start with.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10448) SLS should set default user to handle SYNTH format

2020-09-29 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10448:
-
Summary: SLS should set default user to handle SYNTH format  (was: When use 
the sls (SYNTH JSON input file format) example the user be null will cause 
failed.)

> SLS should set default user to handle SYNTH format
> --
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch
>
>
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org