[jira] [Commented] (YARN-11018) RM rest api show error resources in capacity scheduler with nodelabels

2021-12-09 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456719#comment-17456719
 ] 

Eric Badger commented on YARN-11018:


I think [~epayne] is probably more qualified to review this given that he 
worked on YARN-10343

> RM rest api show error resources in capacity scheduler with nodelabels
> --
>
> Key: YARN-11018
> URL: https://issues.apache.org/jira/browse/YARN-11018
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Major
> Attachments: YARN-11018.001.patch
>
>
> Because resource metrics updated only for "default" partition, allocatedMB, 
> allocatedVCores, totalMB, totalVirtualCores are error in capacity scheduler 
> with nodelabels. 
> When we get cluster metrics use 'curl 
> [http://rm:8088/ws/v1/cluster/metrics',] we get error totalMB and 
> totalVirtualCores.
> It should use resources across partition to replace.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9818) test_docker_util.cc:test_add_mounts doesn't correctly test for parent dir of container-executor.cfg

2021-10-11 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427319#comment-17427319
 ] 

Eric Badger commented on YARN-9818:
---

I believe when you do a native build there is a file created called {{cetest}}. 
You just need to execute that binary. There is also a 
{{test-container-executor}}, but that is a different piece of code (that tests 
other parts of the container-executor forreasons?)

> test_docker_util.cc:test_add_mounts doesn't correctly test for parent dir of 
> container-executor.cfg
> ---
>
> Key: YARN-9818
> URL: https://issues.apache.org/jira/browse/YARN-9818
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9818.001.patch
>
>
> The code attempts to mount a directory that is a parent of 
> container-executor.cfg. However, the docker.allowed.[ro,rw]-mounts settings 
> in the container-executor.cfg don't allow that directory. So the test isn't 
> ever getting to the code where we disallow the mount because it is a parent 
> of container-executor.cfg. The test is disallowing it because the mount isn't 
> in the allowed mounts list. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9818) test_docker_util.cc:test_add_mounts doesn't correctly test for parent dir of container-executor.cfg

2021-10-11 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427309#comment-17427309
 ] 

Eric Badger commented on YARN-9818:
---

This is from a few years ago so I don't quite remember the details, but from 
what I remember, the test was passing back then. The problem was that the test 
wasn't testing what it said it was. This patch was to fix the test so that it 
would accurately test what it was looking to test. But the patch was never 
reviewed/committed

> test_docker_util.cc:test_add_mounts doesn't correctly test for parent dir of 
> container-executor.cfg
> ---
>
> Key: YARN-9818
> URL: https://issues.apache.org/jira/browse/YARN-9818
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9818.001.patch
>
>
> The code attempts to mount a directory that is a parent of 
> container-executor.cfg. However, the docker.allowed.[ro,rw]-mounts settings 
> in the container-executor.cfg don't allow that directory. So the test isn't 
> ever getting to the code where we disallow the mount because it is a parent 
> of container-executor.cfg. The test is disallowing it because the mount isn't 
> in the allowed mounts list. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-23 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419340#comment-17419340
 ] 

Eric Badger commented on YARN-10935:


Also thanks to [~ahussein] for the additional review!

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 2.10.2, 3.3.2, 3.2.4, 3.1.5
>
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch, 
> YARN-10935.branch-3.2.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-23 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10935:
---
Fix Version/s: 3.1.5
   3.2.4
   2.10.2

Thanks for the additional patches, [~epayne]! +1 on them and I've committed 
them. The patches have now been committed to trunk (3.4), branch-3.3, 
branch-3.2, branch-3.1 (apparently unnecessary, but I did it anyway. Oops), and 
branch-2.10

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 2.10.2, 3.3.2, 3.2.4, 3.1.5
>
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch, 
> YARN-10935.branch-3.2.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-16 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10935:
---
Fix Version/s: 3.3.2
   3.4.0

[~epayne], looks like it's clean back to branch-3.3. So I committed it to trunk 
(3.4) and branch-3.3. But I'll need patches for branch-3.2 onwards if you'd 
like it backported

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-14 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415243#comment-17415243
 ] 

Eric Badger commented on YARN-10935:


[~epayne], +1 the patch looks good to me. However, trunk compilation is 
currently failing for me due to HADOOP-17891 and I'd like to get that cleared 
up before committing your patch (I don't like committing things when I can't 
compile)

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10860) Make max container per heartbeat configs refreshable

2021-07-22 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385635#comment-17385635
 ] 

Eric Badger commented on YARN-10860:


Thanks, [~zhuqi]!

> Make max container per heartbeat configs refreshable
> 
>
> Key: YARN-10860
> URL: https://issues.apache.org/jira/browse/YARN-10860
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0, 2.10.2, 3.2.3, 3.3.2
>
> Attachments: YARN-10860.001.patch, YARN-10860.branch-2.10.001.patch
>
>
> {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} 
> and 
> {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} 
> are currently *not* refreshable configs, but I believe they should be. This 
> JIRA is to turn these into refreshable configs, just like 
> {{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} 
> is.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10860) Make max container per heartbeat configs refreshable

2021-07-21 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385024#comment-17385024
 ] 

Eric Badger commented on YARN-10860:


[~zhuqi], thanks for the review and commit! And thanks [~gandras] for the 
additional review. Just a reminder that when committing patches you should 
attempt to cherry-pick them as far back as you can unless they are risky and/or 
unstable. In this case you committed patches to trunk (3.4) and branch-2.10. 
However, the trunk patch should also be cherry-picked back to the other active 
3.x branches, which are branch-3.3 and branch-3.2. Could you please cherry-pick 
the trunk patch to those 2 branches? Thanks!

> Make max container per heartbeat configs refreshable
> 
>
> Key: YARN-10860
> URL: https://issues.apache.org/jira/browse/YARN-10860
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0, 2.10.2
>
> Attachments: YARN-10860.001.patch, YARN-10860.branch-2.10.001.patch
>
>
> {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} 
> and 
> {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} 
> are currently *not* refreshable configs, but I believe they should be. This 
> JIRA is to turn these into refreshable configs, just like 
> {{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} 
> is.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10867) YARN should expose a ENV used to map a custom device into docker container

2021-07-20 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384513#comment-17384513
 ] 

Eric Badger commented on YARN-10867:


https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/DockerContainers.html

I believe you can just use {{docker.allowed.devices}} in your 
container-executor.cfg file if you need to mount an actual device. However, 
you'll need to be a privileged container to do that, so you'll need to also set 
{{docker.privileged-containers.enabled=true}}. Note that running privileged 
containers is very risky and adds a lot of security concerns with it, so 
proceed with caution. 

After setting those, I believe you can use 
{{YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS}} to specify the mounts that you want, 
including the device such as {{/dev/fuse}}

> YARN should expose a ENV used to map a custom device into docker container
> --
>
> Key: YARN-10867
> URL: https://issues.apache.org/jira/browse/YARN-10867
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chi Heng
>Priority: Major
>
> In some scenarios, like mounting a FUSE in docker,user needs to map a custom 
> device (eg. /dev/fuse) into docker container.I notice that an adddevice 
> method is defined in [ 
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/docker/DockerRunCommand.java
>  
> |https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/docker/DockerRunCommand.java]
>  ,I suppose that an ENV or config property should to be exposed to user to 
> call this method



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10860) Make max container per heartbeat configs refreshable

2021-07-20 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10860:
---
Attachment: (was: YARN-10860.001.patch)

> Make max container per heartbeat configs refreshable
> 
>
> Key: YARN-10860
> URL: https://issues.apache.org/jira/browse/YARN-10860
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10860.001.patch, YARN-10860.branch-2.10.001.patch
>
>
> {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} 
> and 
> {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} 
> are currently *not* refreshable configs, but I believe they should be. This 
> JIRA is to turn these into refreshable configs, just like 
> {{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} 
> is.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10860) Make max container per heartbeat configs refreshable

2021-07-20 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10860:
---
Attachment: YARN-10860.001.patch

> Make max container per heartbeat configs refreshable
> 
>
> Key: YARN-10860
> URL: https://issues.apache.org/jira/browse/YARN-10860
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10860.001.patch, YARN-10860.branch-2.10.001.patch
>
>
> {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} 
> and 
> {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} 
> are currently *not* refreshable configs, but I believe they should be. This 
> JIRA is to turn these into refreshable configs, just like 
> {{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} 
> is.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10860) Make max container per heartbeat configs refreshable

2021-07-19 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10860:
---
Attachment: YARN-10860.branch-2.10.001.patch

> Make max container per heartbeat configs refreshable
> 
>
> Key: YARN-10860
> URL: https://issues.apache.org/jira/browse/YARN-10860
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10860.001.patch, YARN-10860.branch-2.10.001.patch
>
>
> {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} 
> and 
> {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} 
> are currently *not* refreshable configs, but I believe they should be. This 
> JIRA is to turn these into refreshable configs, just like 
> {{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} 
> is.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10860) Make max container per heartbeat configs refreshable

2021-07-19 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10860:
---
Attachment: YARN-10860.001.patch

> Make max container per heartbeat configs refreshable
> 
>
> Key: YARN-10860
> URL: https://issues.apache.org/jira/browse/YARN-10860
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10860.001.patch
>
>
> {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} 
> and 
> {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} 
> are currently *not* refreshable configs, but I believe they should be. This 
> JIRA is to turn these into refreshable configs, just like 
> {{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} 
> is.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10860) Make max container per heartbeat configs refreshable

2021-07-19 Thread Eric Badger (Jira)
Eric Badger created YARN-10860:
--

 Summary: Make max container per heartbeat configs refreshable
 Key: YARN-10860
 URL: https://issues.apache.org/jira/browse/YARN-10860
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Eric Badger
Assignee: Eric Badger


{{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} 
and {{yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled}} 
are currently *not* refreshable configs, but I believe they should be. This 
JIRA is to turn these into refreshable configs, just like 
{{yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments}} is.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10761) Add more event type to RM Dispatcher event metrics.

2021-05-06 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340347#comment-17340347
 ] 

Eric Badger commented on YARN-10761:


Thanks for the patch, [~zhuqi]. 

Is there a reason we need to call {{create()}} twice for each metric? The code 
in the patch calls it onceee to create the {{GenericEventTypeMetricsManager}} 
and then again just so that it can call {{getEnumClass()}}. Seems better to 
save the first {{create()}} call off into a local variable and then call 
{{getEnumClass()}} on that so we don't have to call {{create()}} twice per 
metric

> Add more event type to RM Dispatcher event metrics.
> ---
>
> Key: YARN-10761
> URL: https://issues.apache.org/jira/browse/YARN-10761
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10761.001.patch, image-2021-05-06-16-38-51-406.png, 
> image-2021-05-06-16-39-28-362.png
>
>
> Since YARN-9615  add NodesListManagerEventType to event metrics.
> And we'd better add total 4 busy event type to the metrics according to 
> YARN-9927.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10745) Change Log level from info to debug for few logs and remove unnecessary debuglog checks

2021-05-05 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339816#comment-17339816
 ] 

Eric Badger commented on YARN-10745:


Hi [~dmmkr], thanks for the patch. Overall I think it has changes that make 
sense, but I have a few comments/questions

{noformat}
-  if (LOG.isDebugEnabled()) {
-LOG.debug("Auth is SASL user=\"{}\" JAAS context=\"{}\"",
-jaasClientIdentity, jaasClientEntry);
-  }
+LOG.debug("Auth is SASL user=\"{}\" JAAS context=\"{}\"",
+  jaasClientIdentity, jaasClientEntry);
{noformat}
Looks like the wrong indentation here

{noformat}
   switch (purgePolicy) {
 case SkipOnChildren:
   // don't do the deletion... continue to next record
-  if (LOG.isDebugEnabled()) {
-LOG.debug("Skipping deletion");
-  }
+LOG.debug("Skipping deletion");
   toDelete = false;
   break;
 case PurgeAll:
   // mark for deletion
-  if (LOG.isDebugEnabled()) {
-LOG.debug("Scheduling for deletion with children");
-  }
+LOG.debug("Scheduling for deletion with children");
   toDelete = true;
   entries = new ArrayList(0);
   break;
 case FailOnChildren:
-  if (LOG.isDebugEnabled()) {
-LOG.debug("Failing deletion operation");
-  }
+LOG.debug("Failing deletion operation");
   throw new PathIsNotEmptyDirectoryException(path);
{noformat}
Same here with the case statements

{noformat}
 List clusterNodeReports = yarnClient.getNodeReports(
 NodeState.RUNNING);
-LOG.info("Got Cluster node info from ASM");
+if (clusterNodeReports.isEmpty()) {
+  LOG.info("Got Empty Cluster node Report info from ASM");
+}
{noformat}
Is {{clusterNodeReports}} guaranteeed to be non-null here? Otherwise we can NPE

{noformat}
-// NodeManager is the last service to start, so NodeId is available.
+// NodeStatusUpdater is the last service to start, so NodeId is available.
{noformat}
I'm not sure what this change is for. The comment seems to imply that 
NodeStatusUpdater is the last service to start, so the service that populates 
NodeId will already be done. Nodemanager is probably the last service to start 
overall since it adds all of the other services, but I don't think the change 
in the comment makes the code any clearer

{noformat}
+  LOG.info("Callback succeeded for initializing request processing " +
+  "pipeline for an AM ");
{noformat}
Can you comment on the this log statement? Have you found this useful for 
debugging? Does it only get logged rarely?

{noformat}
-LOG.info("hostsReader include:{" +
-StringUtils.join(",", hostsReader.getHosts()) +
-"} exclude:{" +
-StringUtils.join(",", hostsReader.getExcludedHosts()) + "}");
-
+if (!hostsReader.getHosts().isEmpty() ||
+!hostsReader.getExcludedHosts().isEmpty()) {
+  LOG.info("hostsReader include:{" +
+  StringUtils.join(",", hostsReader.getHosts()) +
+  "} exclude:{" +
+  StringUtils.join(",", hostsReader.getExcludedHosts()) + "}");
+}
{noformat}
I feel like we're losing information here. Knowing that the hostsReader is 
empty is helpful. We could log it differently, but I don't think we want to 
lose that information

> Change Log level from info to debug for few logs and remove unnecessary 
> debuglog checks
> ---
>
> Key: YARN-10745
> URL: https://issues.apache.org/jira/browse/YARN-10745
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Minor
> Attachments: YARN-10745.001.patch, YARN-10745.002.patch, 
> YARN-10745.003.patch, YARN-10745.004.patch
>
>
> Change the info log level to debug for few logs so that the load on the 
> logger decreases in large cluster and improves the performance.
> Remove the unnecessary isDebugEnabled() checks for printing strings without 
> any string concatenation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10648) NM local logs are not cleared after uploading to hdfs

2021-05-04 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339350#comment-17339350
 ] 

Eric Badger commented on YARN-10648:


The patch looks good, but I'll wait for [~grepas], [~rkanter], and [~snemeth] 
to look at this as they were the ones that worked on the original code that 
created the issue. 

> NM local logs are not cleared after uploading to hdfs
> -
>
> Key: YARN-10648
> URL: https://issues.apache.org/jira/browse/YARN-10648
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.2.0
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Attachments: YARN-10648.001.patch
>
>
> YARN-8273 has induced the following issues.
>  # The {color:#00}delService.delete(deletionTask){color} has been removed 
> from the for loop, and added at the end in finally block. Inside the for loop 
>  we are creating FileDeletionTask for each container, but not storing it, due 
> to this, only the last container log files will be present in the 
> deletionTask and only those files will be removed. Ideally all the container 
> log files which are uploaded must be deleted.
>  # The LogAggregationDFSException is caught in the closeswriter, but when we 
> configure LogAggregationTFileController as logAggregationFileController,  
> this.logAggregationFileController.closeWriter()  itself calls closeWriter, 
> which throws LogAggregationDFSException if any, and the exception is not 
> saved. Again when we try to do closeWriter we dont get any exception and, we 
> are not throwing the LogAggregationDFSException in this scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9927) RM multi-thread event processing mechanism

2021-04-29 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335804#comment-17335804
 ] 

Eric Badger commented on YARN-9927:
---

{noformat}
+// Test multi thread dispatcher
+conf.setBoolean(YarnConfiguration.
+MULTI_THREAD_DISPATCHER_ENABLED, true);
{noformat}
If this is a feature that is disabled by default, I don't think we should have 
it enabled by default in all of the RM tests. I would be happier running it as 
a parameterized test with both multi and single thread dispatchers.

In general I think the patch looks reasonable, but I would like to see testing 
done to see if this makes the problem better or worse. I would think it would 
make things better, but until we run some real tests on it, we won't really 
know. So getting something similar to what [~hcarrot] provided originally would 
be good. That way we can merge this with confidence. 

> RM multi-thread event processing mechanism
> --
>
> Key: YARN-9927
> URL: https://issues.apache.org/jira/browse/YARN-9927
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.0.0, 2.9.2
>Reporter: hcarrot
>Assignee: Qi Zhu
>Priority: Major
> Attachments: RM multi-thread event processing mechanism.pdf, 
> YARN-9927.001.patch, YARN-9927.002.patch, YARN-9927.003.patch, 
> YARN-9927.004.patch, YARN-9927.005.patch
>
>
> Recently, we have observed serious event blocking in RM event dispatcher 
> queue. After analysis of RM event monitoring data and RM event processing 
> logic, we found that
> 1) environment: a cluster with thousands of nodes
> 2) RMNodeStatusEvent dominates 90% time consumption of RM event scheduler
> 3) Meanwhile, RM event processing is in a single-thread mode, and It results 
> in the low headroom of RM event scheduler, thus performance of RM.
> So we proposed a RM multi-thread event processing mechanism to improve RM 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10707) Support custom resources in ResourceUtilization, and update Node GPU Utilization to use.

2021-04-29 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335726#comment-17335726
 ] 

Eric Badger commented on YARN-10707:


Thanks for the updates, [~zhuqi]! +1 I've committed this to trunk (3.4) and 
branch-3.3. There are conflicts backporting back further than that

> Support custom resources in ResourceUtilization, and update Node GPU 
> Utilization to use.
> 
>
> Key: YARN-10707
> URL: https://issues.apache.org/jira/browse/YARN-10707
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10707.001.patch, YARN-10707.002.patch, 
> YARN-10707.003.patch, YARN-10707.004.patch, YARN-10707.005.patch, 
> YARN-10707.006.patch, YARN-10707.007.patch, YARN-10707.008.patch, 
> YARN-10707.009.patch, YARN-10707.010.patch, YARN-10707.011.patch
>
>
> Support gpu in ResourceUtilization, and update Node GPU Utilization to use 
> first.
> It will be very helpful for other use cases about GPU utilization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10707) Support custom resources in ResourceUtilization, and update Node GPU Utilization to use.

2021-04-29 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10707:
---
Fix Version/s: 3.3.1
   3.4.0

> Support custom resources in ResourceUtilization, and update Node GPU 
> Utilization to use.
> 
>
> Key: YARN-10707
> URL: https://issues.apache.org/jira/browse/YARN-10707
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10707.001.patch, YARN-10707.002.patch, 
> YARN-10707.003.patch, YARN-10707.004.patch, YARN-10707.005.patch, 
> YARN-10707.006.patch, YARN-10707.007.patch, YARN-10707.008.patch, 
> YARN-10707.009.patch, YARN-10707.010.patch, YARN-10707.011.patch
>
>
> Support gpu in ResourceUtilization, and update Node GPU Utilization to use 
> first.
> It will be very helpful for other use cases about GPU utilization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-04-27 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333560#comment-17333560
 ] 

Eric Badger commented on YARN-10493:


bq. In theory we could change that if there is a benefit in your opinion, but 
my initial reaction is that adding sub-directories to that namespace may make 
it harder to track images (cleanup, governance, perhaps even quotas, etc.).
I don't think it's a huge deal. A nice to have feature, but if it requires a 
major rework then I don't think it's necessary. The reason I think it would be 
nice is so that we can more cleanly segment our images. E.g. you could have 
{{hadoop/small-image/rhel7:7.9}} or something like that. But again, it's not a 
huge deal if it's difficult

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf, 
> runc-container-repository-v2-design_updated.pdf
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10707) Support custom resources in ResourceUtilization, and update Node GPU Utilization to use.

2021-04-27 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333520#comment-17333520
 ] 

Eric Badger commented on YARN-10707:


Thanks for the update, [~zhuqi]! The content looks good, I just have a few nits 
on naming conventions.

{noformat}
+  public float getNodePhysGpus() throws Exception{
{noformat}
I think a better name for this method would be {{getTotalNodeGpuUtilization}} 
and {{getNodeGpuUtilization}} would be better off as 
{{getAvgNodeGpuUtilization}}. Then {{totalGpuUtilization}} would also be 
changed to {{avgGpuUtilization}} in {{getAvgNodeGpuUtilization}}. This way we 
have a clear distinction on what the methods are returning. Relevant Javadocs 
would also be nice for each of the methods. 

> Support custom resources in ResourceUtilization, and update Node GPU 
> Utilization to use.
> 
>
> Key: YARN-10707
> URL: https://issues.apache.org/jira/browse/YARN-10707
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10707.001.patch, YARN-10707.002.patch, 
> YARN-10707.003.patch, YARN-10707.004.patch, YARN-10707.005.patch, 
> YARN-10707.006.patch, YARN-10707.007.patch, YARN-10707.008.patch, 
> YARN-10707.009.patch
>
>
> Support gpu in ResourceUtilization, and update Node GPU Utilization to use 
> first.
> It will be very helpful for other use cases about GPU utilization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7713) Add parallel copying of directories into FSDownload

2021-04-27 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333483#comment-17333483
 ] 

Eric Badger commented on YARN-7713:
---

Thanks for taking this up, [~ChrisKarampeazis]. I noticed that you weren't a 
contributor in JIRA yet so I've added you as one. You may now assign JIRAs to 
yourself in all of the Hadoop projects (YARN, Common, HDFS, Mapreduce).

In general I think the PR looks good, but I think it would be nice and not too 
awfully difficult to sort the list of files to be localized by file size and 
then split the list into chunks based on that. That way we don't end up with 1 
thread downloading 4 files of 2 KB and another thread downloading 4 files of 4 
GB.

> Add parallel copying of directories into FSDownload
> ---
>
> Key: YARN-7713
> URL: https://issues.apache.org/jira/browse/YARN-7713
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Miklos Szegedi
>Assignee: Christos Karampeazis-Papadakis
>Priority: Major
>  Labels: newbie, pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> YARN currently copies directories sequentially when localizing. This could be 
> improved to do in parallel, since the source blocks are normally on different 
> nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7713) Add parallel copying of directories into FSDownload

2021-04-27 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger reassigned YARN-7713:
-

Assignee: Christos Karampeazis-Papadakis

> Add parallel copying of directories into FSDownload
> ---
>
> Key: YARN-7713
> URL: https://issues.apache.org/jira/browse/YARN-7713
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Miklos Szegedi
>Assignee: Christos Karampeazis-Papadakis
>Priority: Major
>  Labels: newbie, pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> YARN currently copies directories sequentially when localizing. This could be 
> improved to do in parallel, since the source blocks are normally on different 
> nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-04-27 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1773#comment-1773
 ] 

Eric Badger commented on YARN-10493:


What I'm saying on the split thing is that in the current state 
{{hadoop/rhel7/myimage:current}} would throw an exception. But I don't see why 
that is necessary. In the above case, why not have {{hadoop}} as the namespace 
and {{rhel7/myimage:current}} as the image name? 

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf, 
> runc-container-repository-v2-design_updated.pdf
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10707) Support custom resources in ResourceUtilization, and update Node GPU Utilization to use.

2021-04-26 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332739#comment-17332739
 ] 

Eric Badger commented on YARN-10707:


Thanks for the updated patch, [~zhuqi]! It's much cleaner and much smaller now

{noformat}
 float nodeGpuUtilization = 0F;
+float nodeGpus = 0F;
 try {
   if (gpuNodeResourceUpdateHandler != null) {
 nodeGpuUtilization =
 gpuNodeResourceUpdateHandler.getNodeGpuUtilization();
+nodeGpus =
+gpuNodeResourceUpdateHandler.getNodePhysGpus();
   }
 } catch (Exception e) {
   LOG.error("Get Node GPU Utilization error: " + e);
 }
{noformat}
Ideally this wouldn't be GPU-specific and we could add all plugin utilizations 
to the nodeUtilization object. But that is beyond the scope of this JIRA, so I 
think this is fine. However, I think we can get a better name than 
{{nodeGpus}}. Maybe {{TotalNodeGpuUtilization}}?

Additionally, why are we sending the average GPU utilization to the NM metrics, 
but the total GPU utilization to the RM? Memory and CPU are consistent across 
the two. I don't understand why GPU is different.

> Support custom resources in ResourceUtilization, and update Node GPU 
> Utilization to use.
> 
>
> Key: YARN-10707
> URL: https://issues.apache.org/jira/browse/YARN-10707
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10707.001.patch, YARN-10707.002.patch, 
> YARN-10707.003.patch, YARN-10707.004.patch, YARN-10707.005.patch, 
> YARN-10707.006.patch, YARN-10707.007.patch
>
>
> Support gpu in ResourceUtilization, and update Node GPU Utilization to use 
> first.
> It will be very helpful for other use cases about GPU utilization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10749) Can't remove all node labels after add node label without nodemanager port, broken by YARN-10647

2021-04-23 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10749:
---
Fix Version/s: 3.2.3
   2.10.2
   3.1.5
   3.3.1
   3.4.0

Thanks for the patch, [~dmmkr] and [~zhuqi] for the review. +1 committed to 
trunk (3.4), branch-3.3, branch-3.2, branch-3.1, and branch-2.10

> Can't remove all node labels after add node label without nodemanager port, 
> broken by YARN-10647
> 
>
> Key: YARN-10749
> URL: https://issues.apache.org/jira/browse/YARN-10749
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: YARN-10749.001.patch, YARN-10749.002.patch
>
>
> The fix done in YARN-10501, doesn't work after YARN-10647.
> To reproduce follow the same steps in YARN-10501



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-04-20 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326150#comment-17326150
 ] 

Eric Badger commented on YARN-10493:


Thanks for the latest patch. I tested out the patch along with the CLI tool 
from YARN-10494 and everything seems to be working well. The addition of 
namespaces has fixed my issue from last time with the {{hadoop/rhel7:current}} 
image. But I do have a few comments. In addition to these comments, I would 
like to commit both this JIRA and YARN-10494 at the same time, because I don't 
particularly think either makes sense without the other. And YARN-10494 has a 
blocker on it because of the docker->overlayfs incompatibilities with 
whiteout/opaque files. Anyway, here are my comments on this JIRA

{noformat}
if (!tag.equals("")) {
{noformat}
nit: There are a few times where the patch uses this, but I think {{isEmpty()}} 
is more appropriate than {{equals("")}}. 

{noformat}
String[] nameParts = imageCoordinates.split("/", -1);
String imageTag;
if (nameParts.length == 2) {
  metaNamespace = nameParts[0];
  imageTag = nameParts[1];
} else if (nameParts.length == 1) {
  imageTag = nameParts[0];
} else {
  throw new IllegalArgumentException("Invalid image coordinates: "
  + imageCoordinates);
}
{noformat}
According to the 
[documentation|https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#split-java.lang.String-int-]
 for {{split}}, this code will create a String array with the number of 
elements equal to the number of {{/}} + 1. But then we only look at the first 2 
parts of the array. Is there a reason not to take only the part before the 
first slash as the namespace and then take the rest as the imageTag?

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf, 
> runc-container-repository-v2-design_updated.pdf
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10707) Support custom resources in ResourceUtilization, and update Node GPU Utilization to use.

2021-04-20 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326097#comment-17326097
 ] 

Eric Badger commented on YARN-10707:


Thanks for the patch, [~zhuqi]. To decrease the size of the patch, I think it 
would be better to keep the ResourceUtilization.newInstance method signature 
the same (i.e. with pmem, vmem, and cpu). And then create a new method 
signature with those 3 parameters plus the new custom resources. The 
newInstance method with only 3 parameters can call the method with 4 parameters 
and just assume that the custom resources will be null. That way we won't have 
to modify as many files changing all of the newInstance calls to add null. The 
same logic can be used for {{addTo}} and {{subtractFrom}}

{noformat}
  public void setCustomResource(String resourceName, float utilization) {
if (customResources != null &&
resourceName != null && !resourceName.isEmpty()) {
  customResources.put(resourceName, utilization);
}
  }
{noformat}
I don't think the {{customResources != null}} check is necessary. 
{{customResources}} is initialized to a new HashMap and the only place that it 
is assigned is in {{setCustomResources}}, but that method only sets it if the 
parameter is non-null. 

{noformat}
+nodeUtilization =
+ResourceUtilization.newInstance(
+(int) (pmem >> 20), // B -> MB
+(int) (vmem >> 20), // B -> MB
+vcores, // Used Virtual Cores
+customResources);  // Used GPUs
+
+nodeUtilization.
+setCustomResource(ResourceInformation.GPU_URI, nodeGpus);
+
+
{noformat}
Maybe it's just me, but I think it makes more sense to set the custom resources 
before passing them as a parameter to newInstance. Just like we're setting cpu 
and mem in newInstance instead of setting them to 0 and then setting them after

> Support custom resources in ResourceUtilization, and update Node GPU 
> Utilization to use.
> 
>
> Key: YARN-10707
> URL: https://issues.apache.org/jira/browse/YARN-10707
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10707.001.patch, YARN-10707.002.patch, 
> YARN-10707.003.patch, YARN-10707.004.patch, YARN-10707.005.patch, 
> YARN-10707.006.patch
>
>
> Support gpu in ResourceUtilization, and update Node GPU Utilization to use 
> first.
> It will be very helpful for other use cases about GPU utilization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10743) Add a policy for not aggregating for containers which are killed because exceeding container log size limit.

2021-04-20 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326048#comment-17326048
 ] 

Eric Badger commented on YARN-10743:


I don't really have a big issue with adding this as an option that is disabled 
by default. It's not something that I would ever want to enable in my clusters, 
but if there is use for it in other scenarios, then I don't have a big issue 
with it. [~Jim_Brennan], do you agree or do you still have concerns with adding 
this?

> Add a policy for not aggregating for containers which are killed because 
> exceeding container log size limit.
> 
>
> Key: YARN-10743
> URL: https://issues.apache.org/jira/browse/YARN-10743
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10743.001.patch, image-2021-04-20-10-41-01-057.png
>
>
> Since YARN-10471 supported container log size limited for kill.
> We'd better to add a policy that can not aggregated for those containers, so 
> that to reduce the pressure of HDFS etc.
> cc [~epayne] [~Jim_Brennan] [~ebadger]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10723) Change CS nodes page in UI to support custom resource.

2021-04-20 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10723:
---
Fix Version/s: 3.2.3
   3.1.5
   3.3.1
   3.4.0

[~zhuqi], thanks for the patch! +1 I've committed this to trunk (3.4), 
branch-3.3, branch-3.2, and branch-3.1

> Change CS nodes page in UI to support custom resource.
> --
>
> Key: YARN-10723
> URL: https://issues.apache.org/jira/browse/YARN-10723
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10723.001.patch, YARN-10723.002.patch, 
> YARN-10723.003.patch, YARN-10723.004.patch, YARN-10723.005.patch, 
> image-2021-04-06-17-22-32-733.png
>
>
> Node page now only support gpu for custom resource.
> We should make this supported for all custom resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2021-04-19 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10460:
---
Fix Version/s: 2.10.2
   3.1.5

Thanks for the review, [~Jim_Brennan]. The spotbugs is unrelated to this patch 
(in a different file, Server.java). I've committed the 2.10 patch to 
branch-2.10.

This jira has now been committed to trunk (3.4), branch-3.3, branch-3.2, 
branch-3.1, and branch-2.10

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch, YARN-10460-branch-2.10.002.patch, 
> YARN-10460-branch-3.2.002.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.serve

[jira] [Commented] (YARN-10715) Remove hardcoded resource values (e.g. GPU/FPGA) in code.

2021-04-19 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325387#comment-17325387
 ] 

Eric Badger commented on YARN-10715:


Finally getting around to looking at this and I don't think removing the 
hardcoded values from ResourceUtils is necessary. It's not necessary for the 
resource translation to be there for GPUs and FPGAs, but it also doesn't hurt 
anything. I'm not quite sure why {{yarn.io/gpu}} was chosen anyway, since that 
seems like a pretty complex name for something as simple as a gpu. But I'm sure 
there was a good reason.

Anyway, we should definitely strive to remove any code that is hardcoding 
calculations for GPUs/FPGAs and generalize them to any extended resource type. 
But in this case, this is just a simple translation from {{gpu}} to 
{{yarn.io/gpu}} and similarily for FPGAs. I'm inclined to close this as won't do

> Remove hardcoded resource values (e.g. GPU/FPGA) in code.
> -
>
> Key: YARN-10715
> URL: https://issues.apache.org/jira/browse/YARN-10715
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10715.001.patch
>
>
> https://issues.apache.org/jira/browse/YARN-10503?focusedCommentId=17307772&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17307772
> As above comment , we should remove hardcoded resource values (e.g. GPU/FPGA) 
> in code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10743) Add a policy for not aggregating for containers which are killed because exceeding container log size limit.

2021-04-19 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325343#comment-17325343
 ] 

Eric Badger commented on YARN-10743:


I have the same concern as [~Jim_Brennan]. If the flink logs are large enough 
that the container is getting killed, don't you want to check the logs to see 
what happened? I'm trying to understand the scenario where you wouldn't want 
the logs even though your container failed due to large log size. Is there a 
reason that you don't care about the logs in this instance?

> Add a policy for not aggregating for containers which are killed because 
> exceeding container log size limit.
> 
>
> Key: YARN-10743
> URL: https://issues.apache.org/jira/browse/YARN-10743
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10743.001.patch
>
>
> Since YARN-10471 supported container log size limited for kill.
> We'd better to add a policy that can not aggregated for those containers, so 
> that to reduce the pressure of HDFS etc.
> cc [~epayne] [~Jim_Brennan] [~ebadger]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2021-04-19 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325329#comment-17325329
 ] 

Eric Badger commented on YARN-10460:


Posting a branch-2.10 patch that doesn't use a lambda expression (not support 
in Java 7).

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch, YARN-10460-branch-2.10.002.patch, 
> YARN-10460-branch-3.2.002.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdat

[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2021-04-19 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10460:
---
Attachment: YARN-10460-branch-2.10.002.patch

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch, YARN-10460-branch-2.10.002.patch, 
> YARN-10460-branch-3.2.002.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the {{clientExecutor}} in 

[jira] [Comment Edited] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2021-04-19 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325291#comment-17325291
 ] 

Eric Badger edited comment on YARN-10460 at 4/19/21, 8:26 PM:
--

Thanks for the review, [~Jim_Brennan]! I've committed the 3.2 patch to 
branch-3.2 and cherry-picked it to branch-3.1. So now this has been committed 
to trunk (3.4), branch-3.3, branch-3.2, and branch-3.1


was (Author: ebadger):
Thanks for the review, [~Jim_Brennan]! I've committed the 3.2 patch to 
branch-3.2. So now this has been committed to trunk (3.4), branch-3.3, and 
branch-3.2

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch, YARN-10460-branch-3.2.002.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagement

[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2021-04-19 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10460:
---
Fix Version/s: 3.2.3

Thanks for the review, [~Jim_Brennan]! I've committed the 3.2 patch to 
branch-3.2. So now this has been committed to trunk (3.4), branch-3.3, and 
branch-3.2

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch, YARN-10460-branch-3.2.002.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdate

[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2021-04-19 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325231#comment-17325231
 ] 

Eric Badger commented on YARN-10460:


The unit tests seem unrelated and don't fail for me locally. [~pbacsko], 
[~aajisaka], [~Jim_Brennan], could one of your review the branch-3.2 patch?

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch, YARN-10460-branch-3.2.002.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.test

[jira] [Commented] (YARN-10723) Change CS nodes page in UI to support custom resource.

2021-04-19 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325210#comment-17325210
 ] 

Eric Badger commented on YARN-10723:


Looks like it still never ran. [~zhuqi], can you re-upload the patch? 

> Change CS nodes page in UI to support custom resource.
> --
>
> Key: YARN-10723
> URL: https://issues.apache.org/jira/browse/YARN-10723
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10723.001.patch, YARN-10723.002.patch, 
> YARN-10723.003.patch, YARN-10723.004.patch, image-2021-04-06-17-22-32-733.png
>
>
> Node page now only support gpu for custom resource.
> We should make this supported for all custom resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10723) Change CS nodes page in UI to support custom resource.

2021-04-16 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324097#comment-17324097
 ] 

Eric Badger commented on YARN-10723:


Precommit never ran on the latest patch, so I cancelled the patch and 
resubmitted. I also tested out the patch on my GPU environment as well as my 
non-GPU environment and both look good. I'm +1 on the patch pending HadoopQA

> Change CS nodes page in UI to support custom resource.
> --
>
> Key: YARN-10723
> URL: https://issues.apache.org/jira/browse/YARN-10723
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10723.001.patch, YARN-10723.002.patch, 
> YARN-10723.003.patch, YARN-10723.004.patch, image-2021-04-06-17-22-32-733.png
>
>
> Node page now only support gpu for custom resource.
> We should make this supported for all custom resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2021-04-16 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324085#comment-17324085
 ] 

Eric Badger commented on YARN-10460:


Reopening and attaching a patch for branch-3.2 that puts {{clearClientCache}} 
in ProtobufRpcEngine instead of ProtobufRpcEngine2, since ProtobufRpcEngine2 
doesn't exist in branch-3.2

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch, YARN-10460-branch-3.2.002.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nod

[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2021-04-16 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10460:
---
Attachment: YARN-10460-branch-3.2.002.patch

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch, YARN-10460-branch-3.2.002.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the 
> c

[jira] [Reopened] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2021-04-16 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger reopened YARN-10460:


> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch, YARN-10460-branch-3.2.002.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the 
> client object in {{ProtobufRpcEngine}}/{{Protob

[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2021-04-16 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10460:
---

I backported this to branch-3.3. There's a merge conflict with branch-3.2 that 
I'm looking into. HADOOP-17602 was fairly recently merged which means this 
issue show up in all active branches without this fix

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(T

[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2021-04-16 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10460:
---
Fix Version/s: 3.3.1

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the 
> client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngin

[jira] [Updated] (YARN-10503) Support queue capacity in terms of absolute resources with custom resourceType.

2021-04-09 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10503:
---
Fix Version/s: 3.3.1

Thanks for the patch, [~zhuqi]. +1 committed to branch-3.3. This has now been 
committed to trunk (3.4) and branch-3.3.

> Support queue capacity in terms of absolute resources with custom 
> resourceType.
> ---
>
> Key: YARN-10503
> URL: https://issues.apache.org/jira/browse/YARN-10503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10503-branch-3.3.010.patch, YARN-10503.001.patch, 
> YARN-10503.002.patch, YARN-10503.003.patch, YARN-10503.004.patch, 
> YARN-10503.005.patch, YARN-10503.006.patch, YARN-10503.007.patch, 
> YARN-10503.008.patch, YARN-10503.009.patch, YARN-10503.010.patch
>
>
> Now the absolute resources are memory and cores.
> {code:java}
> /**
>  * Different resource types supported.
>  */
> public enum AbsoluteResourceType {
>   MEMORY, VCORES;
> }{code}
> But in our GPU production clusters, we need to support more resourceTypes.
> It's very import for cluster scaling when with different resourceType 
> absolute demands.
>  
> This Jira will handle GPU first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10503) Support queue capacity in terms of absolute resources with custom resourceType.

2021-04-08 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10503:
---
Fix Version/s: 3.4.0

Thanks for the updates, [~zhuqi]. +1 on patch 10. And thanks for the reviews, 
[~gandras] and [~pbacsko]. I've committed this to trunk (3.4)

[~zhuqi], there is a conflict on the cherry-pick back to branch-3.3. It looks 
like a fairly trivial fix. Could you make the necessary adjustments and put up 
a patch for branch-3.3?

> Support queue capacity in terms of absolute resources with custom 
> resourceType.
> ---
>
> Key: YARN-10503
> URL: https://issues.apache.org/jira/browse/YARN-10503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: YARN-10503.001.patch, YARN-10503.002.patch, 
> YARN-10503.003.patch, YARN-10503.004.patch, YARN-10503.005.patch, 
> YARN-10503.006.patch, YARN-10503.007.patch, YARN-10503.008.patch, 
> YARN-10503.009.patch, YARN-10503.010.patch
>
>
> Now the absolute resources are memory and cores.
> {code:java}
> /**
>  * Different resource types supported.
>  */
> public enum AbsoluteResourceType {
>   MEMORY, VCORES;
> }{code}
> But in our GPU production clusters, we need to support more resourceTypes.
> It's very import for cluster scaling when with different resourceType 
> absolute demands.
>  
> This Jira will handle GPU first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-08 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10702:
---
Fix Version/s: 3.2.3
   3.1.5

Thanks for the additional patches, [~Jim_Brennan]. I committed the 3.2 and 3.1 
patches. This has now been committed to trunk (3.4), branch-3.3, branch-3.2, 
and branch-3.1.

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.1.006.patch, 
> YARN-10702-branch-3.2.006.patch, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-06 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10702:
---
Fix Version/s: 3.3.1
   3.4.0

Thanks for the patch, [~Jim_Brennan]. I've committed it to branch-3.3 So now 
it's been committed to trunk (3.4) and branch-3.3. There's another conflict 
with branch-3.2. If you'd like it to go back there, please provide a patch for 
that branch as well.

Also a belated thanks to [~gandras] and [~zhuqi] for the reviews on the 
original patch

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
> Attachments: Scheduler-Busy.png, YARN-10702-branch-3.3.006.patch, 
> YARN-10702.001.patch, YARN-10702.002.patch, YARN-10702.003.patch, 
> YARN-10702.004.patch, YARN-10702.005.patch, YARN-10702.006.patch, 
> simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-04-05 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315191#comment-17315191
 ] 

Eric Badger commented on YARN-10702:


[~Jim_Brennan], thanks for the patch. +1 I've committed this to trunk (3.4). 
There are a few small conflicts with the cherry-pick to branch-3.3. Would you 
mind putting up a patch for branch-3.3?

> Add cluster metric for amount of CPU used by RM Event Processor
> ---
>
> Key: YARN-10702
> URL: https://issues.apache.org/jira/browse/YARN-10702
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: Scheduler-Busy.png, YARN-10702.001.patch, 
> YARN-10702.002.patch, YARN-10702.003.patch, YARN-10702.004.patch, 
> YARN-10702.005.patch, YARN-10702.006.patch, simon-scheduler-busy.png
>
>
> Add a cluster metric to track the cpu usage of the ResourceManager Event 
> Processing thread.   This lets us know when the critical path of the RM is 
> running out of headroom.
> This feature was originally added for us internally by [~nroberts] and we've 
> been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2021-03-29 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10501:
---
Fix Version/s: 2.10.2

Thanks for the patch/patience [~caozhiqiang]. Finally HadoopQA is back to 
normal. I fixed up the small checkstyle on the patch and committed it to 
branch-2.10. 

> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, 
> YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch, 
> YARN-10502-branch-2.10.003.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}}
> 4.yarn rmadmin -replaceLabelsOnNode "server001"
> 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}}
>  {code}
> You can see after the 4 process to remove nodemanager labels, the label info 
> is still in the node info.
> {code:java}
>  641 case REPLACE:
>  642 replaceNodeForLabels(nodeId, host.labels, labels);
>  643 replaceLabelsForNode(nodeId, host.labels, labels);
>  644 host.labels.clear();
>  645 host.labels.addAll(labels);
>  646 for (Node node : host.nms.values()) {
>  647 replaceNodeForLabels(node.nodeId, node.labels, labels);
>  649 node.labels = null;
>  650 }
>  651 break;{code}
> The cause is in 647 line, when add labels to node without port, the 0 port 
> and the real nm port with be both add to node info, and when remove labels, 
> the parameter node.labels in 647 line is null, so it will not remove the old 
> label. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with custom resourceType.

2021-03-26 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309788#comment-17309788
 ] 

Eric Badger commented on YARN-10503:


Thanks for the update, [~zhuqi]. This might be a little too picky, but I think 
it would be better if {{appendCustomResources}} just created the string instead 
of appending it to {{resourceString}}. That way we can keep the current 
structure of the StringBuilders at the caller level.

{noformat}
resourceString
.append("[" + AbsoluteResourceType.MEMORY.toString().toLowerCase() + "="
+ resource.getMemorySize() + ","
+ AbsoluteResourceType.VCORES.toString().toLowerCase() + "="
+ resource.getVirtualCores()
+ getCustomResourcesString(resource) + "]");
{noformat}
It could look something like this, where {{getCustomResourcesString}} returns 
the string instead of appending it. 

{noformat}
// Custom resource type defined by user.
// Such as GPU FPGA etc.
if (!resourceTypes.contains(resourceName)) {
  resource.setResourceInformation(resourceName, ResourceInformation
  .newInstance(resourceName, units, resourceValue));
  return;
}

// map it based on key.
AbsoluteResourceType resType = AbsoluteResourceType
.valueOf(StringUtils.toUpperCase(resourceName));
switch (resType) {
case MEMORY :
  resource.setMemorySize(resourceValue);
  break;
case VCORES :
  resource.setVirtualCores(resourceValue.intValue());
  break;
default :
  resource.setResourceInformation(resourceName, ResourceInformation
  .newInstance(resourceName, units, resourceValue));
  break;
}
  }
{noformat}
This snippet of code confuses me a bit. What's the purpose of thee initial if 
statement? If the resource doesn't already container the resource in question, 
we add it and then return. But in the case that it does exist, we go to the 
switch statement, add it, and then return. I looks like the if statement is 
unnecessary. Am I missing something?

> Support queue capacity in terms of absolute resources with custom 
> resourceType.
> ---
>
> Key: YARN-10503
> URL: https://issues.apache.org/jira/browse/YARN-10503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10503.001.patch, YARN-10503.002.patch, 
> YARN-10503.003.patch, YARN-10503.004.patch, YARN-10503.005.patch, 
> YARN-10503.006.patch, YARN-10503.007.patch
>
>
> Now the absolute resources are memory and cores.
> {code:java}
> /**
>  * Different resource types supported.
>  */
> public enum AbsoluteResourceType {
>   MEMORY, VCORES;
> }{code}
> But in our GPU production clusters, we need to support more resourceTypes.
> It's very import for cluster scaling when with different resourceType 
> absolute demands.
>  
> This Jira will handle GPU first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2021-03-26 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309640#comment-17309640
 ] 

Eric Badger commented on YARN-10501:


bq. Backporting HADOOP-16870 to branch-2.10 should mitigate this error. I'll 
check the patch there.
Gotcha. Thanks, [~aajisaka]. We'll resubmit the patch once HADOOP-16870 is 
backported

> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, 
> YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}}
> 4.yarn rmadmin -replaceLabelsOnNode "server001"
> 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}}
>  {code}
> You can see after the 4 process to remove nodemanager labels, the label info 
> is still in the node info.
> {code:java}
>  641 case REPLACE:
>  642 replaceNodeForLabels(nodeId, host.labels, labels);
>  643 replaceLabelsForNode(nodeId, host.labels, labels);
>  644 host.labels.clear();
>  645 host.labels.addAll(labels);
>  646 for (Node node : host.nms.values()) {
>  647 replaceNodeForLabels(node.nodeId, node.labels, labels);
>  649 node.labels = null;
>  650 }
>  651 break;{code}
> The cause is in 647 line, when add labels to node without port, the 0 port 
> and the real nm port with be both add to node info, and when remove labels, 
> the parameter node.labels in 647 line is null, so it will not remove the old 
> label. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with custom resourceType.

2021-03-25 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309058#comment-17309058
 ] 

Eric Badger commented on YARN-10503:


Thanks for the patch, [~zhuqi]! Here are a few comments

{noformat}
+if (ResourceUtils.getNumberOfKnownResourceTypes() > 2) {
+  ResourceInformation[] resources =
+  resource.getResources();
+  for (int i = 2; i < resources.length; i++) {
+ResourceInformation resInfo = resources[i];
+resourceString.append(","
++ resInfo.getName() + "=" + resInfo.getValue());
+  }
+}
{noformat}
This code snippet is repeated a lot of different times in this patch. I think 
it would make sense to make this into a method so that we don't have so much 
code repetition.

{{splits[0]}} is used enough in the code that I think it makes sense to make it 
into a local variable for better readability.


> Support queue capacity in terms of absolute resources with custom 
> resourceType.
> ---
>
> Key: YARN-10503
> URL: https://issues.apache.org/jira/browse/YARN-10503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10503.001.patch, YARN-10503.002.patch, 
> YARN-10503.003.patch, YARN-10503.004.patch, YARN-10503.005.patch, 
> YARN-10503.006.patch
>
>
> Now the absolute resources are memory and cores.
> {code:java}
> /**
>  * Different resource types supported.
>  */
> public enum AbsoluteResourceType {
>   MEMORY, VCORES;
> }{code}
> But in our GPU production clusters, we need to support more resourceTypes.
> It's very import for cluster scaling when with different resourceType 
> absolute demands.
>  
> This Jira will handle GPU first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2021-03-25 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309028#comment-17309028
 ] 

Eric Badger commented on YARN-10501:


[~aajisaka], can you help out here? The Yetus bug is blocking this patch

> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, 
> YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}}
> 4.yarn rmadmin -replaceLabelsOnNode "server001"
> 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}}
>  {code}
> You can see after the 4 process to remove nodemanager labels, the label info 
> is still in the node info.
> {code:java}
>  641 case REPLACE:
>  642 replaceNodeForLabels(nodeId, host.labels, labels);
>  643 replaceLabelsForNode(nodeId, host.labels, labels);
>  644 host.labels.clear();
>  645 host.labels.addAll(labels);
>  646 for (Node node : host.nms.values()) {
>  647 replaceNodeForLabels(node.nodeId, node.labels, labels);
>  649 node.labels = null;
>  650 }
>  651 break;{code}
> The cause is in 647 line, when add labels to node without port, the 0 port 
> and the real nm port with be both add to node info, and when remove labels, 
> the parameter node.labels in 647 line is null, so it will not remove the old 
> label. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10713) ClusterMetrics should support custom resource capacity related metrics.

2021-03-25 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10713:
---
Fix Version/s: 3.3.1
   3.4.0

Thanks for the patch, [~zhuqi]. I tested this out on my local GPU environment 
and everything looks good. +1 I've committed this to trunk (3.4) and 
branch-3.3. The cherry-pick comes back clean to branch-3.2, but there is a 
compilation error that I believe is due to some other requisite patches not 
being pulled back there. If you'd like it to go back to branch-3.2, we'll need 
to do some additional work. Closing for now, though. 

> ClusterMetrics should support custom resource capacity related metrics.
> ---
>
> Key: YARN-10713
> URL: https://issues.apache.org/jira/browse/YARN-10713
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10713.001.patch, YARN-10713.002.patch
>
>
> YARN-10688
> Only add gpu resource capacity related metrics, i think we should improve it 
> to support custom resources as [~ebadger] suggested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10713) ClusterMetrics should support custom resource capacity related metrics.

2021-03-25 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308817#comment-17308817
 ] 

Eric Badger commented on YARN-10713:


[~zhuqi], I very much appreciate the patches and am trying to review as quickly 
as possible. But the number of different patches going on concurrently is quite 
overwhelming. I will do my best to review them in a timely matter

> ClusterMetrics should support custom resource capacity related metrics.
> ---
>
> Key: YARN-10713
> URL: https://issues.apache.org/jira/browse/YARN-10713
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10713.001.patch, YARN-10713.002.patch
>
>
> YARN-10688
> Only add gpu resource capacity related metrics, i think we should improve it 
> to support custom resources as [~ebadger] suggested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with custom resourceType.

2021-03-24 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308187#comment-17308187
 ] 

Eric Badger commented on YARN-10503:


I'm fine with moving the effort of removing hardcoded resource values (e.g. 
GPU/FPGA) to a follow-up JIRA. But only if that JIRA is going to be worked on. 
Because right now we are adding code debt with everything hardcoded value we 
add to the code.

> Support queue capacity in terms of absolute resources with custom 
> resourceType.
> ---
>
> Key: YARN-10503
> URL: https://issues.apache.org/jira/browse/YARN-10503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10503.001.patch, YARN-10503.002.patch, 
> YARN-10503.003.patch, YARN-10503.004.patch
>
>
> Now the absolute resources are memory and cores.
> {code:java}
> /**
>  * Different resource types supported.
>  */
> public enum AbsoluteResourceType {
>   MEMORY, VCORES;
> }{code}
> But in our GPU production clusters, we need to support more resourceTypes.
> It's very import for cluster scaling when with different resourceType 
> absolute demands.
>  
> This Jira will handle GPU first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-03-24 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308180#comment-17308180
 ] 

Eric Badger commented on YARN-10493:


I did have a weird umask set. Reverting back to the default umask fixed the 
localizer errors that I posted above. However, the tool should probably 
explicitly specify 755 and 644 perms for the directories and files respectively

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-03-24 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308173#comment-17308173
 ] 

Eric Badger commented on YARN-10493:


Hmm, must be a default umask issue or something on my testing environment. 
After fixing the perms and a small container-executor.cfg issue, I've been able 
to successfully run a sleep job using the V2 plugins!

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-03-24 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308169#comment-17308169
 ] 

Eric Badger commented on YARN-10493:


{noformat}
[ebadger@foo hadoop]$ hadoop fs -ls /runc-root/*/*/*
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of 
HADOOP_PREFIX.
-rw---  10 ebadger supergroup  11166 2021-03-24 00:06 
/runc-root/config/a9/a9a241e617577cf0da93c89010d0026de8327c8220c732f2ede29d2ce15588cf
-rw---  10 ebadger supergroup   4096 2021-03-24 00:06 
/runc-root/layer/30/30e0953848a62b1477a41e6c09d2f781ace3e6cfb215f01be9dacabdd9480d90.sqsh
-rw---  10 ebadger supergroup156 2021-03-24 00:06 
/runc-root/layer/30/30e0953848a62b1477a41e6c09d2f781ace3e6cfb215f01be9dacabdd9480d90.tar.gz
-rw---  10 ebadger supergroup   4096 2021-03-24 00:06 
/runc-root/layer/71/71a0a977369767160e5ed5f8ab279558240eb19efa880725049c4651411c2225.sqsh
-rw---  10 ebadger supergroup185 2021-03-24 00:06 
/runc-root/layer/71/71a0a977369767160e5ed5f8ab279558240eb19efa880725049c4651411c2225.tar.gz
-rw---  10 ebadger supergroup   26095616 2021-03-24 00:06 
/runc-root/layer/72/726141ff510fe8ee7d540faa490649332a561f79ce9b5d02045f7e0db5e4cfbc.sqsh
-rw---  10 ebadger supergroup   26687036 2021-03-24 00:06 
/runc-root/layer/72/726141ff510fe8ee7d540faa490649332a561f79ce9b5d02045f7e0db5e4cfbc.tar.gz
-rw---  10 ebadger supergroup  12288 2021-03-24 00:06 
/runc-root/layer/8c/8c4f37442a65ac28bb23c8a0c408f1f2c061b8928abfaf8a40050ebae6130974.sqsh
-rw---  10 ebadger supergroup  12057 2021-03-24 00:06 
/runc-root/layer/8c/8c4f37442a65ac28bb23c8a0c408f1f2c061b8928abfaf8a40050ebae6130974.tar.gz
-rw---  10 ebadger supergroup   98697216 2021-03-24 00:06 
/runc-root/layer/98/98dc2361422a32ef978770b879dd1d0079242cc55980cfd2205939a6796d309f.sqsh
-rw---  10 ebadger supergroup   99451719 2021-03-24 00:06 
/runc-root/layer/98/98dc2361422a32ef978770b879dd1d0079242cc55980cfd2205939a6796d309f.tar.gz
-rw---  10 ebadger supergroup  121638912 2021-03-24 00:06 
/runc-root/layer/d5/d5eed6e7343938620cc22b9b2db0d6a77e8de76ed1387e7b7288d034fd86a92d.sqsh
-rw---  10 ebadger supergroup  123724262 2021-03-24 00:06 
/runc-root/layer/d5/d5eed6e7343938620cc22b9b2db0d6a77e8de76ed1387e7b7288d034fd86a92d.tar.gz
-rw---  10 ebadger supergroup  205000704 2021-03-24 00:06 
/runc-root/layer/f6/f64d0ed7e0de1a2c56f69eaae5cfe8351b23cc2d600cc226557881862a5bdf5f.sqsh
-rw---  10 ebadger supergroup  205322058 2021-03-24 00:06 
/runc-root/layer/f6/f64d0ed7e0de1a2c56f69eaae5cfe8351b23cc2d600cc226557881862a5bdf5f.tar.gz
-rw---  10 ebadger supergroup   1791 2021-03-24 00:06 
/runc-root/manifest/f8/f849453b22d5e6a2e2f1390dc021cd2a786bcd923fffa9e778f3be6c87a0d3fe
-rw---  10 ebadger supergroup236 2021-03-24 00:06 
/runc-root/meta/hadoop/rhel7@current.properties
-rw---  10 ebadger supergroup236 2021-03-24 00:15 
/runc-root/meta/hadoop/rhel7@latest.properties
-rw---  10 ebadger supergroup236 2021-03-24 20:21 
/runc-root/meta/library/rhel7@current.properties
{noformat}
For reference, the perms on all of the files are 600.

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-03-24 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308167#comment-17308167
 ] 

Eric Badger commented on YARN-10493:


{noformat}
2021-03-24 20:21:56,225 WARN  [Public Localizer] 
localizer.LocalResourcesTrackerImpl 
(LocalResourcesTrackerImpl.java:handle(184)) - Received LOCALIZATION_FAILED 
event for request { 
/runc-root/layer/f6/f64d0ed7e0de1a2c56f69eaae5cfe8351b23cc2d600cc226557881862a5bdf5f.sqsh,
 1616544364679, FILE, null } but localized resource is missing
2021-03-24 20:21:56,226 ERROR [Public Localizer] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(1004)) - Failed to download resource { { 
/runc-root/layer/f6/f64d0ed7e0de1a2c56f69eaae5cfe8351b23cc2d600cc226557881862a5bdf5f.sqsh,
 1616544364679, FILE, null },pending,[],50806868248172187,DOWNLOADING} 
java.io.IOException: Resource 
/runc-root/layer/f6/f64d0ed7e0de1a2c56f69eaae5cfe8351b23cc2d600cc226557881862a5bdf5f.sqsh
 is not publicly accessible and as such cannot be part of the public cache.
2021-03-24 20:21:56,227 WARN  [Public Localizer] 
localizer.LocalResourcesTrackerImpl 
(LocalResourcesTrackerImpl.java:handle(184)) - Received LOCALIZATION_FAILED 
event for request { 
/runc-root/layer/d5/d5eed6e7343938620cc22b9b2db0d6a77e8de76ed1387e7b7288d034fd86a92d.sqsh,
 1616544363068, FILE, null } but localized resource is missing
2021-03-24 20:21:56,227 ERROR [Public Localizer] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(1004)) - Failed to download resource { { 
/runc-root/layer/d5/d5eed6e7343938620cc22b9b2db0d6a77e8de76ed1387e7b7288d034fd86a92d.sqsh,
 1616544363068, FILE, null },pending,[],50806868248209381,DOWNLOADING} 
java.io.IOException: Resource 
/runc-root/layer/d5/d5eed6e7343938620cc22b9b2db0d6a77e8de76ed1387e7b7288d034fd86a92d.sqsh
 is not publicly accessible and as such cannot be part of the public cache.
2021-03-24 20:21:56,227 WARN  [Public Localizer] 
localizer.LocalResourcesTrackerImpl 
(LocalResourcesTrackerImpl.java:handle(184)) - Received LOCALIZATION_FAILED 
event for request { 
/runc-root/layer/30/30e0953848a62b1477a41e6c09d2f781ace3e6cfb215f01be9dacabdd9480d90.sqsh,
 1616544367971, FILE, null } but localized resource is missing
2021-03-24 20:21:56,227 ERROR [Public Localizer] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(1004)) - Failed to download resource { { 
/runc-root/layer/30/30e0953848a62b1477a41e6c09d2f781ace3e6cfb215f01be9dacabdd9480d90.sqsh,
 1616544367971, FILE, null },pending,[],50806868248366695,DOWNLOADING} 
java.io.IOException: Resource 
/runc-root/layer/30/30e0953848a62b1477a41e6c09d2f781ace3e6cfb215f01be9dacabdd9480d90.sqsh
 is not publicly accessible and as such cannot be part of the public cache.
2021-03-24 20:21:56,228 WARN  [Public Localizer] 
localizer.LocalResourcesTrackerImpl 
(LocalResourcesTrackerImpl.java:handle(184)) - Received LOCALIZATION_FAILED 
event for request { 
/runc-root/layer/71/71a0a977369767160e5ed5f8ab279558240eb19efa880725049c4651411c2225.sqsh,
 1616544367125, FILE, null } but localized resource is missing
2021-03-24 20:21:56,228 ERROR [Public Localizer] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(1004)) - Failed to download resource { { 
/runc-root/layer/71/71a0a977369767160e5ed5f8ab279558240eb19efa880725049c4651411c2225.sqsh,
 1616544367125, FILE, null },pending,[],50806868248376369,DOWNLOADING} 
java.io.IOException: Resource 
/runc-root/layer/71/71a0a977369767160e5ed5f8ab279558240eb19efa880725049c4651411c2225.sqsh
 is not publicly accessible and as such cannot be part of the public cache.
2021-03-24 20:21:56,228 WARN  [Public Localizer] 
localizer.LocalResourcesTrackerImpl 
(LocalResourcesTrackerImpl.java:handle(184)) - Received LOCALIZATION_FAILED 
event for request { 
/runc-root/config/a9/a9a241e617577cf0da93c89010d0026de8327c8220c732f2ede29d2ce15588cf,
 1616544368404, FILE, null } but localized resource is missing
{noformat}
Building and running with an image under the default "library" namespace, I run 
into these permission errors. The errors are pretty clear, but is this 
something that you run into in your production environments? Is there a setup 
step that is necessary before or after running the CLI tool to fix the perms on 
the files?

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current runc

[jira] [Commented] (YARN-10493) RunC container repository v2

2021-03-24 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308165#comment-17308165
 ] 

Eric Badger commented on YARN-10493:


Yea, I think that would be a good improvement to the plugin implementation.

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-03-23 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307481#comment-17307481
 ] 

Eric Badger commented on YARN-10493:


Additionally, I've run into some issues while testing.

{noformat:title=CLI Invocation}
hadoop jar ./hadoop-tools/hadoop-runc/target/hadoop-runc-3.4.0-SNAPSHOT.jar  
org.apache.hadoop.runc.tools.ImportDockerImage -r docker.foobar.com: 
hadoop-images/hadoop/rhel7 hadoop/rhel7
{noformat}

{noformat}
[ebadger@foo hadoop]$ hadoop fs -ls 
/runc-root/meta/hadoop/rhel7@latest.properties
-rw---  10 ebadger supergroup236 2021-03-24 00:15 
/runc-root/meta/hadoop/rhel7@latest.properties
{noformat}
Here's the properties file after the CLI tool completes.

{noformat}
  

yarn.nodemanager.runtime.linux.runc.image-tag-to-manifest-plugin

org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.runc.ImageTagToManifestV2Plugin
  

  

yarn.nodemanager.runtime.linux.runc.manifest-to-resources-plugin

org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.runc.HdfsManifestToResourcesV2Plugin
  
{noformat}
Then I set these properties as well as adding {{runc}} to the allowed-runtimes 
config.

{noformat}
export 
vars="YARN_CONTAINER_RUNTIME_TYPE=runc,YARN_CONTAINER_RUNTIME_RUNC_IMAGE=hadoop/rhel7";
 $HADOOP_HOME/bin/hadoop jar 
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.*-tests.jar
 sleep -Dyarn.app.mapreduce.am.env="HADOOP_MAPRED_HOME=$HADOOP_HOME" 
-Dmapreduce.admin.user.env="HADOOP_MAPRED_HOME=$HADOOP_HOME" 
-Dyarn.app.mapreduce.am.env=$vars -Dmapreduce.map.env=$vars 
-Dmapreduce.reduce.env=$vars -mt 1 -rt 1 -m 1 -r 1
{noformat}
I ran a sleep job using this command.

{noformat}
2021-03-24 00:26:07,823 DEBUG [NM ContainerManager dispatcher] 
runc.ImageTagToManifestV2Plugin 
(ImageTagToManifestV2Plugin.java:getHdfsImageToHashReader(144)) - Checking HDFS 
for image file: /runc-root/meta/library/hadoop/rhel7@latest.properties
2021-03-24 00:26:07,825 WARN  [NM ContainerManager dispatcher] 
runc.ImageTagToManifestV2Plugin 
(ImageTagToManifestV2Plugin.java:getHdfsImageToHashReader(148)) - Did not load 
the hdfs image to hash properties file, file doesn't exist
2021-03-24 00:26:07,828 WARN  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:transition(1261)) - Failed to parse 
resource-request
java.io.FileNotFoundException: File does not exist: 
/runc-root/manifest/ha/hadoop/rhel7
{noformat}
Then I got this error in the NM when it was trying to resolve the tag. It added 
the default {{metaNamespaceDir}} (which is library) into the path when looking 
for the properties file. But when the CLI tool ran, it didn't add the 
{{metaNamespaceDir}}. I didn't have the config set in my configs at all, so the 
NM was using the conf default. 

I'm not sure if I did anything wrong here or not, but it seems inconsistent to 
me. Let me know what you think

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-03-23 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307467#comment-17307467
 ] 

Eric Badger commented on YARN-10493:


[~MatthewSharp], thanks for the PR. Just starting to take a look at this now. I 
am wondering if the document is still up to date though. Is the PR you put up 
still a good reflection of what's in the document? Just want to make sure

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10517) QueueMetrics has incorrect Allocated Resource when labelled partitions updated

2021-03-23 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307457#comment-17307457
 ] 

Eric Badger commented on YARN-10517:


[~epayne], this change looks reasonable to me, but I'd like to get an extra 
pair of eyes on it as it has to do with scheduler internals

> QueueMetrics has incorrect Allocated Resource when labelled partitions updated
> --
>
> Key: YARN-10517
> URL: https://issues.apache.org/jira/browse/YARN-10517
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0, 3.3.0
>Reporter: sibyl.lv
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10517-branch-3.2.001.patch, YARN-10517.001.patch, 
> wrong metrics.png
>
>
> After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has 
> incorrect allocated jmx, such as  {color:#660e7a}allocatedMB, 
> {color}{color:#660e7a}allocatedVCores and 
> {color}{color:#660e7a}allocatedContainers, {color}when the node partition is 
> updated from "DEFAULT" to other label and there are  running applications.
> Steps to reproduce
> ==
>  # Configure capacity-scheduler.xml with label configuration
>  # Submit one application to default partition and run
>  # Add label "tpcds" to cluster and replace label on node1 and node2 to be 
> "tpcds" when the above application is running
>  # Note down "VCores Used" at Web UI
>  # When the application is finished, the metrics get wrong (screenshots 
> attached).
> ==
>  
> FiCaSchedulerApp doesn't update queue metrics when CapacityScheduler handles 
> this event {color:#660e7a}NODE_LABELS_UPDATE.{color}
> So we should release container resource from old partition and add used 
> resource to new partition, just as updating queueUsage.
> {code:java}
> // code placeholder
> public void nodePartitionUpdated(RMContainer rmContainer, String oldPartition,
> String newPartition) {
>   Resource containerResource = rmContainer.getAllocatedResource();
>   this.attemptResourceUsage.decUsed(oldPartition, containerResource);
>   this.attemptResourceUsage.incUsed(newPartition, containerResource);
>   getCSLeafQueue().decUsedResource(oldPartition, containerResource, this);
>   getCSLeafQueue().incUsedResource(newPartition, containerResource, this);
>   // Update new partition name if container is AM and also update AM resource
>   if (rmContainer.isAMContainer()) {
> setAppAMNodePartitionName(newPartition);
> this.attemptResourceUsage.decAMUsed(oldPartition, containerResource);
> this.attemptResourceUsage.incAMUsed(newPartition, containerResource);
> getCSLeafQueue().decAMUsedResource(oldPartition, containerResource, this);
> getCSLeafQueue().incAMUsedResource(newPartition, containerResource, this);
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10707) Support gpu in ResourceUtilization, and update Node GPU Utilization to use.

2021-03-23 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307427#comment-17307427
 ] 

Eric Badger commented on YARN-10707:


Similar to my 
[comment|https://issues.apache.org/jira/browse/YARN-10503?focusedCommentId=17307421&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17307421]
 on YARN-10503, I believe that the approach we should take here should allow 
for arbitrary resources, not hardcoded for GPUs. It's a lot of work to make 
GPUs a first class resource, but should only be a little more work in addition 
to make arbitrary resources (which can include GPUs) a first class resource.

> Support gpu in ResourceUtilization, and update Node GPU Utilization to use.
> ---
>
> Key: YARN-10707
> URL: https://issues.apache.org/jira/browse/YARN-10707
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10707.001.patch, YARN-10707.002.patch, 
> YARN-10707.003.patch
>
>
> Support gpu in ResourceUtilization, and update Node GPU Utilization to use 
> first.
> It will be very helpful for other use cases about GPU utilization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with custom resourceType.

2021-03-23 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307421#comment-17307421
 ] 

Eric Badger commented on YARN-10503:


bq. Do we want to treat GPUs and FPGAs like that? In other parts of the code, 
we have mem/vcore as primary resources, then an array of other resources. 
I believe the correct approach is to leave memroy and vcores as "first class" 
resources and then add on logic to add arbitrary extended resources, such as 
GPU or FPGA. The arbitrary extended resources should not be hardcoded values. 
The point is that we're doing the work right now to support GPUs. But in 2 
years if some new resource needs to be tracked and used, we don't want to have 
to redo all of this work again. We should make sure that our work right here is 
extended to any future arbitrary resources

> Support queue capacity in terms of absolute resources with custom 
> resourceType.
> ---
>
> Key: YARN-10503
> URL: https://issues.apache.org/jira/browse/YARN-10503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10503.001.patch, YARN-10503.002.patch, 
> YARN-10503.003.patch
>
>
> Now the absolute resources are memory and cores.
> {code:java}
> /**
>  * Different resource types supported.
>  */
> public enum AbsoluteResourceType {
>   MEMORY, VCORES;
> }{code}
> But in our GPU production clusters, we need to support more resourceTypes.
> It's very import for cluster scaling when with different resourceType 
> absolute demands.
>  
> This Jira will handle GPU first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9618) NodeListManager event improvement

2021-03-23 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307413#comment-17307413
 ] 

Eric Badger edited comment on YARN-9618 at 3/23/21, 8:52 PM:
-

bq. Actually, why we use an other async dispatcher here is try to make the 
rmDispatcher#eventQueue not boom to affect other event process. The boom will 
transformed to nodeListManagerDispatcher#eventQueue.
I think [~gandras]'s point is that all of the events are going to go through 
{{rmDispatcher}} either way. Without the proposed change, {{rmDispatcher}} will 
get the event in the eventQueue and will also do the processing. With this 
proposed change, {{rmDispatcher}} will get the event and then it will copy it 
over to {{nodeListManagerDispatcher}}. Then {{nodeListManagerDispatcher}} will 
do the processing. But in both cases, {{rmDispatcher}} is dealing with 
{{RMAppNodeUpdateEvent}} in some way. 

So the question is whether copying the event or processing the event takes more 
time. If copying the event takes more time than processing the event, then this 
change only makes things worse. If processing the event takes more time than 
copying the event to the new async dispatcher, then this change makes sense and 
will remove some load on the {{rmDispatcher}}.

[~gandras], is that right?


was (Author: ebadger):
bq. Actually, why we use an other async dispatcher here is try to make the 
rmDispatcher#eventQueue not boom to affect other event process. The boom will 
transformed to nodeListManagerDispatcher#eventQueue.
I think [~gandras]'s point is that all of the events are going to go through 
{{rmDispatcher}} either way. Without the proposed change, {{rmDispatcher}} will 
get the event in the eventQueue and will also do the processing. With this 
proposed change, {{rmDispatcher}} will get the event and then it will copy it 
over to {{nodeListManagerDispatcher}}. Then {{nodeListManagerDispatcher}} will 
do the processing. But in both cases, {{rmDispatcher}} is dealing with 
{{RMAppNodeUpdateEvent}}s in some way. 

So the question is whether copying the event or processing the event takes more 
time. If copying the event takes more time than processing the event, then this 
change only makes things worse. If processing the event takes more time than 
copying the event to the new async dispatcher, then this change makes sense and 
will remove some load on the {{rmDispatcher}}.

[~gandras], is that right?

> NodeListManager event improvement
> -
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-9618.001.patch, YARN-9618.002.patch, 
> YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9618) NodeListManager event improvement

2021-03-23 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307413#comment-17307413
 ] 

Eric Badger commented on YARN-9618:
---

bq. Actually, why we use an other async dispatcher here is try to make the 
rmDispatcher#eventQueue not boom to affect other event process. The boom will 
transformed to nodeListManagerDispatcher#eventQueue.
I think [~gandras]'s point is that all of the events are going to go through 
{{rmDispatcher}} either way. Without the proposed change, {{rmDispatcher}} will 
get the event in the eventQueue and will also do the processing. With this 
proposed change, {{rmDispatcher}} will get the event and then it will copy it 
over to {{nodeListManagerDispatcher}}. Then {{nodeListManagerDispatcher}} will 
do the processing. But in both cases, {{rmDispatcher}} is dealing with 
{{RMAppNodeUpdateEvent}}s in some way. 

So the question is whether copying the event or processing the event takes more 
time. If copying the event takes more time than processing the event, then this 
change only makes things worse. If processing the event takes more time than 
copying the event to the new async dispatcher, then this change makes sense and 
will remove some load on the {{rmDispatcher}}.

[~gandras], is that right?

> NodeListManager event improvement
> -
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-9618.001.patch, YARN-9618.002.patch, 
> YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10704) The CS effective capacity for absolute mode in UI should support GPU and other custom resources.

2021-03-23 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307379#comment-17307379
 ] 

Eric Badger commented on YARN-10704:


I'm not very familiar with the new YARN UI v2. Will this change automatically 
apply to both UIs? Or do we need to add extra stuff for it to be supported in 
both?

> The CS effective capacity for absolute mode in UI should support GPU and 
> other custom resources.
> 
>
> Key: YARN-10704
> URL: https://issues.apache.org/jira/browse/YARN-10704
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10704.001.patch, YARN-10704.002.patch, 
> YARN-10704.003.patch, image-2021-03-19-12-05-28-412.png, 
> image-2021-03-19-12-08-35-273.png
>
>
> Actually there are no information about the effective capacity about GPU in 
> UI for absolute resource mode.
> !image-2021-03-19-12-05-28-412.png|width=873,height=136!
> But we have this information in QueueMetrics:
> !image-2021-03-19-12-08-35-273.png|width=613,height=268!
>  
> It's very important for our GPU users to use in absolute mode, there still 
> have nothing to know GPU absolute information in CS Queue UI. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10701) The yarn.resource-types should support multi types without trimmed.

2021-03-18 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10701:
---
Fix Version/s: 3.3.1
   3.4.0

+1. Thanks for the patch, [~zhuqi]. I've committed this to trunk (3.4) and 
branch-3.3

> The yarn.resource-types should support multi types without trimmed.
> ---
>
> Key: YARN-10701
> URL: https://issues.apache.org/jira/browse/YARN-10701
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10701.001.patch, YARN-10701.002.patch
>
>
> {code:java}
> 
>  
>  yarn.resource-types
>  yarn.io/gpu, yarn.io/fpga
>  
>  {code}
>  When i configured the resource type above with gpu and fpga, the error 
> happend:
>  
> {code:java}
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: ' yarn.io/fpga' is 
> not a valid resource name. A valid resource name must begin with a letter and 
> contain only letters, numbers, and any of: '.', '_', or '-'. A valid resource 
> name may also be optionally preceded by a name space followed by a slash. A 
> valid name space consists of period-separated groups of letters, numbers, and 
> dashes.{code}
>   
>  The resource types should support trim.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10616) Nodemanagers cannot detect GPU failures

2021-03-18 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304456#comment-17304456
 ] 

Eric Badger edited comment on YARN-10616 at 3/18/21, 9:22 PM:
--

The issue with graceful decommissioning is that you have to edit a file on the 
RM. It would be nice to be able to run a {{yarn rmadmin}} command from a remote 
host to tell the RM to graceful decom a node. AFAIK that functionality doesn't 
exist. 

I still don't like the idea of completely undermining {{-updateNodeResource}}. 
I think I would be more on board with a feature that is disabled by default, 
but can be enabled. That way we won't break any existing ways of doing things, 
but will give more flexibility to those who want to detect these types of 
failures. They will just have to understand that it isn't compatible with 
{{-updateNodeResource}}


was (Author: ebadger):
The issue with graceful decommissioning is that you have to edit a file on the 
RM. It would be nice to be able to run a `yarn rmadmin` command from a remote 
host to tell the RM to graceful decom a node. AFAIK that functionality doesn't 
exist. 

I still don't like the idea of completely undermining {{-updateNodeResource}}. 
I think I would be more on board with a feature that is disabled by default, 
but can be enabled. That way we won't break any existing ways of doing things, 
but will give more flexibility to those who want to detect these types of 
failures. They will just have to understand that it isn't compatible with 
{{-updateNodeResource}}

> Nodemanagers cannot detect GPU failures
> ---
>
> Key: YARN-10616
> URL: https://issues.apache.org/jira/browse/YARN-10616
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> As stated above, the bug is that GPUs can fail, but the NM doesn't notice the 
> failure. The NM will continue to schedule tasks onto the failed GPU, but the 
> GPU won't actually work and so the container will likely fail or run very 
> slowly on the CPU. 
> My initial thought on solving this is to add NM resource capabilities to the 
> NM-RM heartbeat and have the RM update its view of the NM's resource 
> capabilities on each heartbeat. This would be a fairly trivial change, but 
> comes with the unfortunate side effect that it completely undermindes {{yarn 
> rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the 
> assumption is that the node will retain these new resource capabilities until 
> either the NM or RM is restarted. But with a heartbeat interaction constantly 
> updating those resource capabilities from the NM perspective, the explicit 
> changes via {{-updateNodeResource}} would be lost on the next heartbeat. We 
> could potentially add a flag to ignore the heartbeat updates for any node who 
> has had {{-updateNodeResource}} called on it (until a re-registration). But 
> in this case, the node would no longer get resource capability updates until 
> the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, 
> then that would give potentially unexpected behavior in relation to nodes 
> properly auto-detecting failures.
> Another idea is to add a GPU monitor thread on the NM to periodically run 
> {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that 
> number decreased, the node would hook into the health check status and mark 
> itself as unhealthy. The downside of this approach is that a single failed 
> GPU would mean taking out an entire node (e.g. 8 GPUs).
> I would really like to go with the NM-RM heartbeat approach, but the 
> {{-updateNodeResource}} issue bothers me. The second approach is ok I guess, 
> but I also don't like taking down whole GPU nodes when only a single GPU is 
> bad. Would like to hear thoughts of others on how best to approach this
> [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10616) Nodemanagers cannot detect GPU failures

2021-03-18 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304456#comment-17304456
 ] 

Eric Badger commented on YARN-10616:


The issue with graceful decommissioning is that you have to edit a file on the 
RM. It would be nice to be able to run a `yarn rmadmin` command from a remote 
host to tell the RM to graceful decom a node. AFAIK that functionality doesn't 
exist. 

I still don't like the idea of completely undermining {{-updateNodeResource}}. 
I think I would be more on board with a feature that is disabled by default, 
but can be enabled. That way we won't break any existing ways of doing things, 
but will give more flexibility to those who want to detect these types of 
failures. They will just have to understand that it isn't compatible with 
{{-updateNodeResource}}

> Nodemanagers cannot detect GPU failures
> ---
>
> Key: YARN-10616
> URL: https://issues.apache.org/jira/browse/YARN-10616
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> As stated above, the bug is that GPUs can fail, but the NM doesn't notice the 
> failure. The NM will continue to schedule tasks onto the failed GPU, but the 
> GPU won't actually work and so the container will likely fail or run very 
> slowly on the CPU. 
> My initial thought on solving this is to add NM resource capabilities to the 
> NM-RM heartbeat and have the RM update its view of the NM's resource 
> capabilities on each heartbeat. This would be a fairly trivial change, but 
> comes with the unfortunate side effect that it completely undermindes {{yarn 
> rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the 
> assumption is that the node will retain these new resource capabilities until 
> either the NM or RM is restarted. But with a heartbeat interaction constantly 
> updating those resource capabilities from the NM perspective, the explicit 
> changes via {{-updateNodeResource}} would be lost on the next heartbeat. We 
> could potentially add a flag to ignore the heartbeat updates for any node who 
> has had {{-updateNodeResource}} called on it (until a re-registration). But 
> in this case, the node would no longer get resource capability updates until 
> the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, 
> then that would give potentially unexpected behavior in relation to nodes 
> properly auto-detecting failures.
> Another idea is to add a GPU monitor thread on the NM to periodically run 
> {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that 
> number decreased, the node would hook into the health check status and mark 
> itself as unhealthy. The downside of this approach is that a single failed 
> GPU would mean taking out an entire node (e.g. 8 GPUs).
> I would really like to go with the NM-RM heartbeat approach, but the 
> {{-updateNodeResource}} issue bothers me. The second approach is ok I guess, 
> but I also don't like taking down whole GPU nodes when only a single GPU is 
> bad. Would like to hear thoughts of others on how best to approach this
> [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10495) make the rpath of container-executor configurable

2021-03-18 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304333#comment-17304333
 ] 

Eric Badger commented on YARN-10495:


I would suggest using a dockerfile with the same OS version as what you plan to 
run on

> make the rpath of container-executor configurable
> -
>
> Key: YARN-10495
> URL: https://issues.apache.org/jira/browse/YARN-10495
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10495.001.patch, YARN-10495.002.patch
>
>
> In  https://issues.apache.org/jira/browse/YARN-9561 we add dependency on 
> crypto to container-executor, we meet a case that in our jenkins machine, we 
> have libcrypto.so.1.0.0  in shared lib env. but in our nodemanager machine we 
> don't have  libcrypto.so.1.0.0  but *libcrypto.so.1.1.*
> We use a  internal custom dynamic link library environment 
> /usr/lib/x86_64-linux-gnu
> and we build hadoop with parameter as blow
> {code:java}
>  -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu
> {code}
>  
> Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where 
> is libcrypto)
> {code:java}
> -rw-r--r-- 1 root root   240136 Nov 28  2014 libcroco-0.6.so.3.0.1
> -rw-r--r-- 1 root root54550 Jun 18  2017 libcrypt.a
> -rw-r--r-- 1 root root  4306444 Sep 26  2019 libcrypto.a
> lrwxrwxrwx 1 root root   18 Sep 26  2019 libcrypto.so -> 
> libcrypto.so.1.0.0
> -rw-r--r-- 1 root root  2070976 Sep 26  2019 libcrypto.so.1.0.0
> lrwxrwxrwx 1 root root   35 Jun 18  2017 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r-- 1 root root  298 Jun 18  2017 libc.so
> {code}
>  
> Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is 
> libcrypto)
> {code:java}
> -rw-r--r--  1 root root55852 2��   7  2019 libcrypt.a
> -rw-r--r--  1 root root  4864244 9��  28  2019 libcrypto.a
> lrwxrwxrwx  1 root root   16 9��  28  2019 libcrypto.so -> 
> libcrypto.so.1.1
> -rw-r--r--  1 root root  2504576 12�� 24  2019 libcrypto.so.1.0.2
> -rw-r--r--  1 root root  2715840 9��  28  2019 libcrypto.so.1.1
> lrwxrwxrwx  1 root root   35 2��   7  2019 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r--  1 root root  298 2��   7  2019 libc.so
> {code}
>  We build container-executor with 
> The  libcrypto.so 's version is not same case error when we start nodemanager
>  
> {code:java}
> .. 3 more Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: 
> error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared 
> object file: No such file or directory at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306)
>  ... 4 more Caused by: ExitCodeException exitCode=127: 
> /home/hadoop/hadoop/bin/container-executor: error while loading shared 
> libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file 
> or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at 
> org.apache.hadoop.util.Shell.run(Shell.java:901) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154)
>  ... 6 more 
> {code}
>  
> We should make RPATH of container-executor configurable to solve this problem 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10703) Fix potential null pointer error of gpuNodeResourceUpdateHandler in NodeResourceMonitorImpl.

2021-03-18 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10703:
---
Fix Version/s: 3.3.1

I've also committed this to branch-3.3. This has now been committed to trunk 
(3.4) and branch-3.3

> Fix potential null pointer error of gpuNodeResourceUpdateHandler in 
> NodeResourceMonitorImpl.
> 
>
> Key: YARN-10703
> URL: https://issues.apache.org/jira/browse/YARN-10703
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10703.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.

2021-03-18 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10692:
---
Fix Version/s: 3.3.1

I cherry-picked this to branch-3.3 I would like all of the GPU stuff to go back 
to 3.3 if the cherry-picks are clean. 

This has now been committed to trunk (3.4) and branch-3.3

> Add Node GPU Utilization and apply to NodeMetrics.
> --
>
> Key: YARN-10692
> URL: https://issues.apache.org/jira/browse/YARN-10692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10692.001.patch, YARN-10692.002.patch, 
> YARN-10692.003.patch
>
>
> Now there are no node level GPU Utilization, this issue will add it, and add 
> it to NodeMetrics first.
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10703) Fix potential null pointer error of gpuNodeResourceUpdateHandler in NodeResourceMonitorImpl.

2021-03-18 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304313#comment-17304313
 ] 

Eric Badger commented on YARN-10703:


+1 I've committed this to trunk (3.4)

> Fix potential null pointer error of gpuNodeResourceUpdateHandler in 
> NodeResourceMonitorImpl.
> 
>
> Key: YARN-10703
> URL: https://issues.apache.org/jira/browse/YARN-10703
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10703.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10703) Fix potential null pointer error of gpuNodeResourceUpdateHandler in NodeResourceMonitorImpl.

2021-03-18 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10703:
---
Fix Version/s: 3.4.0

> Fix potential null pointer error of gpuNodeResourceUpdateHandler in 
> NodeResourceMonitorImpl.
> 
>
> Key: YARN-10703
> URL: https://issues.apache.org/jira/browse/YARN-10703
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10703.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10688) ClusterMetrics should support GPU capacity related metrics.

2021-03-17 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10688:
---
Fix Version/s: 3.2.3
   3.3.1
   3.4.0

Thanks for the updated patch, [~zhuqi]! +1

I've committed this to trunk (3.4), branch-3.3, and branch-3.2. There was a 
small import conflict that I took care of in the cherry-pick to branch-3.2

> ClusterMetrics should support GPU capacity related metrics.
> ---
>
> Key: YARN-10688
> URL: https://issues.apache.org/jira/browse/YARN-10688
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: metrics, resourcemanager
>Affects Versions: 3.2.2, 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: YARN-10688.001.patch, YARN-10688.002.patch, 
> YARN-10688.003.patch, YARN-10688.004.patch, image-2021-03-11-15-35-49-625.png
>
>
> Now the ClusterMetrics only support memory and Vcore related metrics.
>  
> {code:java}
> @Metric("Memory Utilization") MutableGaugeLong utilizedMB;
> @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores;
> @Metric("Memory Capability") MutableGaugeLong capabilityMB;
> @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores;
> {code}
>  
>  
> !image-2021-03-11-15-35-49-625.png|width=593,height=253!
> In our cluster, we added GPU supported, so i think the GPU related metrics 
> should also be supported by ClusterMetrics.
>  
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with gpu resourceType.

2021-03-16 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302956#comment-17302956
 ] 

Eric Badger commented on YARN-10503:


One initial question I have is whether we should generalize this to any 
resource type (e.g. GPU, FPGA, etc). GPU already isn't a first-class resource 
in YARN. If we aren't going to make it one, then I think it would be prudent to 
make these additions generalized to all arbitrary resources instead of just GPUs

> Support queue capacity in terms of absolute resources with gpu resourceType.
> 
>
> Key: YARN-10503
> URL: https://issues.apache.org/jira/browse/YARN-10503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10503.001.patch, YARN-10503.002.patch
>
>
> Now the absolute resources are memory and cores.
> {code:java}
> /**
>  * Different resource types supported.
>  */
> public enum AbsoluteResourceType {
>   MEMORY, VCORES;
> }{code}
> But in our GPU production clusters, we need to support more resourceTypes.
> It's very import for cluster scaling when with different resourceType 
> absolute demands.
>  
> This Jira will handle GPU first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.

2021-03-16 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302931#comment-17302931
 ] 

Eric Badger commented on YARN-10692:


[~zhuqi], it looks like the unit test failure from Hadoop QA is related to the 
patch. Additionally, there are no unit tests added for the patch. I think it 
would be good to add to TestNodeManagerMetrics

> Add Node GPU Utilization and apply to NodeMetrics.
> --
>
> Key: YARN-10692
> URL: https://issues.apache.org/jira/browse/YARN-10692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10692.001.patch
>
>
> Now there are no node level GPU Utilization, this issue will add it, and add 
> it to NodeMetrics first.
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10688) ClusterMetrics should support GPU capacity related metrics.

2021-03-16 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302893#comment-17302893
 ] 

Eric Badger commented on YARN-10688:


{noformat}
   @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores;
   @Metric("Memory Capability") MutableGaugeLong capabilityMB;
   @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores;
+  @Metric("GPU Capability")
+  private MutableGaugeLong capabilityGPUs;
{noformat}
To maintain consistency, I would actually remove the private here and let the 
checkstyle warning exist. I would prefer to update the checkstyle for them all 
in a separate JIRA. But I think consistency is most important. Other than that, 
the patch looks good to me

> ClusterMetrics should support GPU capacity related metrics.
> ---
>
> Key: YARN-10688
> URL: https://issues.apache.org/jira/browse/YARN-10688
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: metrics, resourcemanager
>Affects Versions: 3.2.2, 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10688.001.patch, YARN-10688.002.patch, 
> YARN-10688.003.patch, image-2021-03-11-15-35-49-625.png
>
>
> Now the ClusterMetrics only support memory and Vcore related metrics.
>  
> {code:java}
> @Metric("Memory Utilization") MutableGaugeLong utilizedMB;
> @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores;
> @Metric("Memory Capability") MutableGaugeLong capabilityMB;
> @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores;
> {code}
>  
>  
> !image-2021-03-11-15-35-49-625.png|width=593,height=253!
> In our cluster, we added GPU supported, so i think the GPU related metrics 
> should also be supported by ClusterMetrics.
>  
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10616) Nodemanagers cannot detect GPU failures

2021-03-16 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302864#comment-17302864
 ] 

Eric Badger commented on YARN-10616:


bq. For the "updateNodeResource" issue, one question is that is it a frequently 
used operation? I'm not ware of the scenario that we use this often.
[~ztang], we use this feature internally. Maybe once or twice a day across all 
of our clusters. Usually to quickly remove a node from a cluster while we 
investigate why it's running slow or causing errors. We will use 
{{updateNodeResource}} to set the node resources to 0, meaning that nothing 
will get scheduled on the node. But the NM will still be running so that we can 
jstack or grab a heap dump. For us at least, the only time we ever use this 
operation is to remove a node from the cluster. So maybe there's a different 
way that we could do that such that it doesn't mess with the node resources. 
Because this really is just a simple hack to get the node to node schedule 
anything else

> Nodemanagers cannot detect GPU failures
> ---
>
> Key: YARN-10616
> URL: https://issues.apache.org/jira/browse/YARN-10616
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> As stated above, the bug is that GPUs can fail, but the NM doesn't notice the 
> failure. The NM will continue to schedule tasks onto the failed GPU, but the 
> GPU won't actually work and so the container will likely fail or run very 
> slowly on the CPU. 
> My initial thought on solving this is to add NM resource capabilities to the 
> NM-RM heartbeat and have the RM update its view of the NM's resource 
> capabilities on each heartbeat. This would be a fairly trivial change, but 
> comes with the unfortunate side effect that it completely undermindes {{yarn 
> rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the 
> assumption is that the node will retain these new resource capabilities until 
> either the NM or RM is restarted. But with a heartbeat interaction constantly 
> updating those resource capabilities from the NM perspective, the explicit 
> changes via {{-updateNodeResource}} would be lost on the next heartbeat. We 
> could potentially add a flag to ignore the heartbeat updates for any node who 
> has had {{-updateNodeResource}} called on it (until a re-registration). But 
> in this case, the node would no longer get resource capability updates until 
> the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, 
> then that would give potentially unexpected behavior in relation to nodes 
> properly auto-detecting failures.
> Another idea is to add a GPU monitor thread on the NM to periodically run 
> {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that 
> number decreased, the node would hook into the health check status and mark 
> itself as unhealthy. The downside of this approach is that a single failed 
> GPU would mean taking out an entire node (e.g. 8 GPUs).
> I would really like to go with the NM-RM heartbeat approach, but the 
> {{-updateNodeResource}} issue bothers me. The second approach is ok I guess, 
> but I also don't like taking down whole GPU nodes when only a single GPU is 
> bad. Would like to hear thoughts of others on how best to approach this
> [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9618) NodeListManager event improvement

2021-03-16 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302860#comment-17302860
 ] 

Eric Badger commented on YARN-9618:
---

The patch looks reasonable to me. Agree with [~gandras] that some stress 
testing should be done before committing

> NodeListManager event improvement
> -
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-9618.001.patch, YARN-9618.002.patch, 
> YARN-9618.003.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2021-03-16 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302782#comment-17302782
 ] 

Eric Badger commented on YARN-10501:


[~aajisaka], [~ahussein], most recent builds are failing due to some yetus flag 
errors. Is this a recent change? Do you know how to mitigate it?

> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10501-branch-2.10.001.patch, YARN-10501.002.patch, 
> YARN-10501.003.patch, YARN-10501.004.patch, YARN-10502-branch-2.10.002.patch, 
> YARN-10502-branch-2.10.003.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}}
> 4.yarn rmadmin -replaceLabelsOnNode "server001"
> 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}}
>  {code}
> You can see after the 4 process to remove nodemanager labels, the label info 
> is still in the node info.
> {code:java}
>  641 case REPLACE:
>  642 replaceNodeForLabels(nodeId, host.labels, labels);
>  643 replaceLabelsForNode(nodeId, host.labels, labels);
>  644 host.labels.clear();
>  645 host.labels.addAll(labels);
>  646 for (Node node : host.nms.values()) {
>  647 replaceNodeForLabels(node.nodeId, node.labels, labels);
>  649 node.labels = null;
>  650 }
>  651 break;{code}
> The cause is in 647 line, when add labels to node without port, the 0 port 
> and the real nm port with be both add to node info, and when remove labels, 
> the parameter node.labels in 647 line is null, so it will not remove the old 
> label. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10495) make the rpath of container-executor configurable

2021-03-16 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302761#comment-17302761
 ] 

Eric Badger commented on YARN-10495:


[~angerszhu], I don't think it's a good idea to ship glibc with Hadoop. glibc 
is tied very closely to the kernel and if the ABI has changed then it won't 
work. 

> make the rpath of container-executor configurable
> -
>
> Key: YARN-10495
> URL: https://issues.apache.org/jira/browse/YARN-10495
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10495.001.patch, YARN-10495.002.patch
>
>
> In  https://issues.apache.org/jira/browse/YARN-9561 we add dependency on 
> crypto to container-executor, we meet a case that in our jenkins machine, we 
> have libcrypto.so.1.0.0  in shared lib env. but in our nodemanager machine we 
> don't have  libcrypto.so.1.0.0  but *libcrypto.so.1.1.*
> We use a  internal custom dynamic link library environment 
> /usr/lib/x86_64-linux-gnu
> and we build hadoop with parameter as blow
> {code:java}
>  -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu
> {code}
>  
> Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where 
> is libcrypto)
> {code:java}
> -rw-r--r-- 1 root root   240136 Nov 28  2014 libcroco-0.6.so.3.0.1
> -rw-r--r-- 1 root root54550 Jun 18  2017 libcrypt.a
> -rw-r--r-- 1 root root  4306444 Sep 26  2019 libcrypto.a
> lrwxrwxrwx 1 root root   18 Sep 26  2019 libcrypto.so -> 
> libcrypto.so.1.0.0
> -rw-r--r-- 1 root root  2070976 Sep 26  2019 libcrypto.so.1.0.0
> lrwxrwxrwx 1 root root   35 Jun 18  2017 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r-- 1 root root  298 Jun 18  2017 libc.so
> {code}
>  
> Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is 
> libcrypto)
> {code:java}
> -rw-r--r--  1 root root55852 2��   7  2019 libcrypt.a
> -rw-r--r--  1 root root  4864244 9��  28  2019 libcrypto.a
> lrwxrwxrwx  1 root root   16 9��  28  2019 libcrypto.so -> 
> libcrypto.so.1.1
> -rw-r--r--  1 root root  2504576 12�� 24  2019 libcrypto.so.1.0.2
> -rw-r--r--  1 root root  2715840 9��  28  2019 libcrypto.so.1.1
> lrwxrwxrwx  1 root root   35 2��   7  2019 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r--  1 root root  298 2��   7  2019 libc.so
> {code}
>  We build container-executor with 
> The  libcrypto.so 's version is not same case error when we start nodemanager
>  
> {code:java}
> .. 3 more Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: 
> error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared 
> object file: No such file or directory at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306)
>  ... 4 more Caused by: ExitCodeException exitCode=127: 
> /home/hadoop/hadoop/bin/container-executor: error while loading shared 
> libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file 
> or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at 
> org.apache.hadoop.util.Shell.run(Shell.java:901) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154)
>  ... 6 more 
> {code}
>  
> We should make RPATH of container-executor configurable to solve this problem 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10690) ClusterMetrics should support GPU utilization related metrics.

2021-03-15 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302009#comment-17302009
 ] 

Eric Badger commented on YARN-10690:


[~zhuqi], can we convert the related JIRAs to be subtasks of this JIRA? That 
will make it easier to track them. 

> ClusterMetrics should support GPU utilization related metrics.
> --
>
> Key: YARN-10690
> URL: https://issues.apache.org/jira/browse/YARN-10690
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2021-03-15 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301993#comment-17301993
 ] 

Eric Badger commented on YARN-10501:


[~caozhiqiang], it doesn't need to be merged to 2.10.1. It has successfully 
been merged to branch-2.10. Try uploading your patch one more time as 
YARN-10502-branch-2.10.002.patch

> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10501-branch-2.10.001.patch, 
> YARN-10501-branch-2.10.1.001.patch, YARN-10501-branch-2.10.1.002.patch, 
> YARN-10501.002.patch, YARN-10501.003.patch, YARN-10501.004.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}}
> 4.yarn rmadmin -replaceLabelsOnNode "server001"
> 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}}
>  {code}
> You can see after the 4 process to remove nodemanager labels, the label info 
> is still in the node info.
> {code:java}
>  641 case REPLACE:
>  642 replaceNodeForLabels(nodeId, host.labels, labels);
>  643 replaceLabelsForNode(nodeId, host.labels, labels);
>  644 host.labels.clear();
>  645 host.labels.addAll(labels);
>  646 for (Node node : host.nms.values()) {
>  647 replaceNodeForLabels(node.nodeId, node.labels, labels);
>  649 node.labels = null;
>  650 }
>  651 break;{code}
> The cause is in 647 line, when add labels to node without port, the 0 port 
> and the real nm port with be both add to node info, and when remove labels, 
> the parameter node.labels in 647 line is null, so it will not remove the old 
> label. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10688) ClusterMetrics should support GPU capacity related metrics.

2021-03-15 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301987#comment-17301987
 ] 

Eric Badger commented on YARN-10688:


[~zhuqi], thanks for the updated patch. To make things a little cleaner, I 
think we can do something like this instead of having 2 separate methods.

{noformat}
  public long getCapabilityGPUs() {
if (capabilityGPUs == null) {
  return 0;
}

return capabilityGPUs.value();
  }
{noformat}

This works in my non-GPU environment. I think it's cleaner, but need you to 
test it out in your GPU environment to make sure it works ok. And then of 
course update the unit tests to use {{getCapabilitiyGPUs}}.

> ClusterMetrics should support GPU capacity related metrics.
> ---
>
> Key: YARN-10688
> URL: https://issues.apache.org/jira/browse/YARN-10688
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, resourcemanager
>Affects Versions: 3.2.2, 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10688.001.patch, YARN-10688.002.patch, 
> image-2021-03-11-15-35-49-625.png
>
>
> Now the ClusterMetrics only support memory and Vcore related metrics.
>  
> {code:java}
> @Metric("Memory Utilization") MutableGaugeLong utilizedMB;
> @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores;
> @Metric("Memory Capability") MutableGaugeLong capabilityMB;
> @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores;
> {code}
>  
>  
> !image-2021-03-11-15-35-49-625.png|width=593,height=253!
> In our cluster, we added GPU supported, so i think the GPU related metrics 
> should also be supported by ClusterMetrics.
>  
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10495) make the rpath of container-executor configurable

2021-03-15 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10495:
---
Fix Version/s: 3.3.1

[~angerszhu], I backported this to branch-3.3. There's a conflict past that. If 
you'd like for it to go further, please provide a patch for branch-3.2

It's now been committed to trunk (3.4) and branch-3.3

> make the rpath of container-executor configurable
> -
>
> Key: YARN-10495
> URL: https://issues.apache.org/jira/browse/YARN-10495
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10495.001.patch, YARN-10495.002.patch
>
>
> In  https://issues.apache.org/jira/browse/YARN-9561 we add dependency on 
> crypto to container-executor, we meet a case that in our jenkins machine, we 
> have libcrypto.so.1.0.0  in shared lib env. but in our nodemanager machine we 
> don't have  libcrypto.so.1.0.0  but *libcrypto.so.1.1.*
> We use a  internal custom dynamic link library environment 
> /usr/lib/x86_64-linux-gnu
> and we build hadoop with parameter as blow
> {code:java}
>  -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu
> {code}
>  
> Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where 
> is libcrypto)
> {code:java}
> -rw-r--r-- 1 root root   240136 Nov 28  2014 libcroco-0.6.so.3.0.1
> -rw-r--r-- 1 root root54550 Jun 18  2017 libcrypt.a
> -rw-r--r-- 1 root root  4306444 Sep 26  2019 libcrypto.a
> lrwxrwxrwx 1 root root   18 Sep 26  2019 libcrypto.so -> 
> libcrypto.so.1.0.0
> -rw-r--r-- 1 root root  2070976 Sep 26  2019 libcrypto.so.1.0.0
> lrwxrwxrwx 1 root root   35 Jun 18  2017 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r-- 1 root root  298 Jun 18  2017 libc.so
> {code}
>  
> Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is 
> libcrypto)
> {code:java}
> -rw-r--r--  1 root root55852 2��   7  2019 libcrypt.a
> -rw-r--r--  1 root root  4864244 9��  28  2019 libcrypto.a
> lrwxrwxrwx  1 root root   16 9��  28  2019 libcrypto.so -> 
> libcrypto.so.1.1
> -rw-r--r--  1 root root  2504576 12�� 24  2019 libcrypto.so.1.0.2
> -rw-r--r--  1 root root  2715840 9��  28  2019 libcrypto.so.1.1
> lrwxrwxrwx  1 root root   35 2��   7  2019 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r--  1 root root  298 2��   7  2019 libc.so
> {code}
>  We build container-executor with 
> The  libcrypto.so 's version is not same case error when we start nodemanager
>  
> {code:java}
> .. 3 more Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: 
> error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared 
> object file: No such file or directory at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306)
>  ... 4 more Caused by: ExitCodeException exitCode=127: 
> /home/hadoop/hadoop/bin/container-executor: error while loading shared 
> libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file 
> or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at 
> org.apache.hadoop.util.Shell.run(Shell.java:901) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154)
>  ... 6 more 
> {code}
>  
> We should make RPATH of container-executor configurable to solve this problem 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10688) ClusterMetrics should support GPU related metrics.

2021-03-11 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299824#comment-17299824
 ] 

Eric Badger commented on YARN-10688:


{noformat}
2021-03-11 19:25:11,183 ERROR [SchedulerEventDispatcher:Event Processor] 
event.EventDispatcher (MarkerIgnoringBase.java:error(159)) - Error in handling 
event type NODE_ADDED to the Event Dispatcher
org.apache.hadoop.yarn.exceptions.ResourceNotFoundException: The resource 
manager encountered a problem that should not occur under normal circumstances. 
Please report this error to the Hadoop community by opening a JIRA ticket at 
http://issues.apache.org/jira and including the following information:
* Resource type requested: yarn.io/gpu
* Resource object: 
* The stack trace for this exception: java.lang.Exception
at 
org.apache.hadoop.yarn.exceptions.ResourceNotFoundException.(ResourceNotFoundException.java:47)
at 
org.apache.hadoop.yarn.api.records.Resource.getResourceInformation(Resource.java:263)
at 
org.apache.hadoop.yarn.server.resourcemanager.ClusterMetrics.incrCapability(ClusterMetrics.java:222)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.addNode(ClusterNodeTracker.java:110)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addNode(CapacityScheduler.java:2201)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1937)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
at java.lang.Thread.run(Thread.java:748)
{noformat}

This is the error I get when I start up the RM in a cluster without any GPUs

> ClusterMetrics should support GPU related metrics.
> --
>
> Key: YARN-10688
> URL: https://issues.apache.org/jira/browse/YARN-10688
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, resourcemanager
>Affects Versions: 3.2.2, 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10688.001.patch, image-2021-03-11-15-35-49-625.png
>
>
> Now the ClusterMetrics only support memory and Vcore related metrics.
>  
> {code:java}
> @Metric("Memory Utilization") MutableGaugeLong utilizedMB;
> @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores;
> @Metric("Memory Capability") MutableGaugeLong capabilityMB;
> @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores;
> {code}
>  
>  
> !image-2021-03-11-15-35-49-625.png|width=593,height=253!
> In our cluster, we added GPU supported, so i think the GPU related metrics 
> should also be supported by ClusterMetrics.
>  
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10688) ClusterMetrics should support GPU related metrics.

2021-03-11 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299814#comment-17299814
 ] 

Eric Badger commented on YARN-10688:


{noformat}
+  Integer gpuIndex = ResourceUtils.getResourceTypeIndex()
+  .get(ResourceInformation.GPU_URI);
+  res.getResourceInformation(ResourceInformation.GPU_URI);
+  if (gpuIndex != null) {
+capabilityGPUs.incr(res.
+getResourceValue(ResourceInformation.GPU_URI));
+  }
{noformat}

{noformat}
+  res.getResourceInformation(ResourceInformation.GPU_URI);
{noformat}
Looks like this line is unnecessary

> ClusterMetrics should support GPU related metrics.
> --
>
> Key: YARN-10688
> URL: https://issues.apache.org/jira/browse/YARN-10688
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, resourcemanager
>Affects Versions: 3.2.2, 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10688.001.patch, image-2021-03-11-15-35-49-625.png
>
>
> Now the ClusterMetrics only support memory and Vcore related metrics.
>  
> {code:java}
> @Metric("Memory Utilization") MutableGaugeLong utilizedMB;
> @Metric("Vcore Utilization") MutableGaugeLong utilizedVirtualCores;
> @Metric("Memory Capability") MutableGaugeLong capabilityMB;
> @Metric("Vcore Capability") MutableGaugeLong capabilityVirtualCores;
> {code}
>  
>  
> !image-2021-03-11-15-35-49-625.png|width=593,height=253!
> In our cluster, we added GPU supported, so i think the GPU related metrics 
> should also be supported by ClusterMetrics.
>  
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2021-03-11 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299772#comment-17299772
 ] 

Eric Badger commented on YARN-10501:


[~aajisaka], looks like the precommit is still failing to install jdk 7

> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10501-branch-2.10.1.001.patch, 
> YARN-10501-branch-2.10.1.002.patch, YARN-10501.002.patch, 
> YARN-10501.003.patch, YARN-10501.004.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}}
> 4.yarn rmadmin -replaceLabelsOnNode "server001"
> 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}}
>  {code}
> You can see after the 4 process to remove nodemanager labels, the label info 
> is still in the node info.
> {code:java}
>  641 case REPLACE:
>  642 replaceNodeForLabels(nodeId, host.labels, labels);
>  643 replaceLabelsForNode(nodeId, host.labels, labels);
>  644 host.labels.clear();
>  645 host.labels.addAll(labels);
>  646 for (Node node : host.nms.values()) {
>  647 replaceNodeForLabels(node.nodeId, node.labels, labels);
>  649 node.labels = null;
>  650 }
>  651 break;{code}
> The cause is in 647 line, when add labels to node without port, the 0 port 
> and the real nm port with be both add to node info, and when remove labels, 
> the parameter node.labels in 647 line is null, so it will not remove the old 
> label. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2021-03-09 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298253#comment-17298253
 ] 

Eric Badger commented on YARN-10501:


[~ahussein], [~aajisaka], is this due to any of the recent yetus changes? New 
branch-2.10 patches are failing Hadoop QA because it can't find openjdk-7-jdk

> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10501-branch-2.10.1.001.patch, 
> YARN-10501-branch-2.10.1.002.patch, YARN-10501.002.patch, 
> YARN-10501.003.patch, YARN-10501.004.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}}
> 4.yarn rmadmin -replaceLabelsOnNode "server001"
> 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}}
>  {code}
> You can see after the 4 process to remove nodemanager labels, the label info 
> is still in the node info.
> {code:java}
>  641 case REPLACE:
>  642 replaceNodeForLabels(nodeId, host.labels, labels);
>  643 replaceLabelsForNode(nodeId, host.labels, labels);
>  644 host.labels.clear();
>  645 host.labels.addAll(labels);
>  646 for (Node node : host.nms.values()) {
>  647 replaceNodeForLabels(node.nodeId, node.labels, labels);
>  649 node.labels = null;
>  650 }
>  651 break;{code}
> The cause is in 647 line, when add labels to node without port, the 0 port 
> and the real nm port with be both add to node info, and when remove labels, 
> the parameter node.labels in 647 line is null, so it will not remove the old 
> label. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2021-03-08 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297684#comment-17297684
 ] 

Eric Badger commented on YARN-10501:


Reopening and submitting patch so that Hadoop QA will run

> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10501-branch-2.10.1.001.patch, 
> YARN-10501-branch-2.10.1.002.patch, YARN-10501.002.patch, 
> YARN-10501.003.patch, YARN-10501.004.patch, YARN-10501.005.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}}
> 4.yarn rmadmin -replaceLabelsOnNode "server001"
> 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}}
>  {code}
> You can see after the 4 process to remove nodemanager labels, the label info 
> is still in the node info.
> {code:java}
>  641 case REPLACE:
>  642 replaceNodeForLabels(nodeId, host.labels, labels);
>  643 replaceLabelsForNode(nodeId, host.labels, labels);
>  644 host.labels.clear();
>  645 host.labels.addAll(labels);
>  646 for (Node node : host.nms.values()) {
>  647 replaceNodeForLabels(node.nodeId, node.labels, labels);
>  649 node.labels = null;
>  650 }
>  651 break;{code}
> The cause is in 647 line, when add labels to node without port, the 0 port 
> and the real nm port with be both add to node info, and when remove labels, 
> the parameter node.labels in 647 line is null, so it will not remove the old 
> label. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2021-03-08 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger reopened YARN-10501:


> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10501-branch-2.10.1.001.patch, 
> YARN-10501-branch-2.10.1.002.patch, YARN-10501.002.patch, 
> YARN-10501.003.patch, YARN-10501.004.patch, YARN-10501.005.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}}
> 4.yarn rmadmin -replaceLabelsOnNode "server001"
> 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}}
>  {code}
> You can see after the 4 process to remove nodemanager labels, the label info 
> is still in the node info.
> {code:java}
>  641 case REPLACE:
>  642 replaceNodeForLabels(nodeId, host.labels, labels);
>  643 replaceLabelsForNode(nodeId, host.labels, labels);
>  644 host.labels.clear();
>  645 host.labels.addAll(labels);
>  646 for (Node node : host.nms.values()) {
>  647 replaceNodeForLabels(node.nodeId, node.labels, labels);
>  649 node.labels = null;
>  650 }
>  651 break;{code}
> The cause is in 647 line, when add labels to node without port, the 0 port 
> and the real nm port with be both add to node info, and when remove labels, 
> the parameter node.labels in 647 line is null, so it will not remove the old 
> label. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-08 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10664:
---
Fix Version/s: 3.2.3

Thanks for the patch, [~Jim_Brennan]! +1 from me. The checkstyle warning should 
be cleaned up in a different way than in this patch and I don't think is big 
here. I've committed this to branch-3.2.

Now this has been committed to trunk (3.4), branch-3.3, and branch-3.2.

> Allow parameter expansion in NM_ADMIN_USER_ENV
> --
>
> Key: YARN-10664
> URL: https://issues.apache.org/jira/browse/YARN-10664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: YARN-10664-branch-3.2.004.patch, YARN-10664.001.patch, 
> YARN-10664.002.patch, YARN-10664.003.patch, YARN-10664.004.patch
>
>
> Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
> expansion.  That is, you cannot specify an environment variable such as 
> {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside 
> the container.
> We have a need for this in specifying different java gc options for java 
> processing running inside yarn containers based on which version of java is 
> being used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >