from:"Benjamin Teke \(Jira\)"

[jira] [Assigned] (YARN-11681) Update the cgroup documentation with v2 support

2024-05-17 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke reassigned YARN-11681:


Assignee: Benjamin Teke

> Update the cgroup documentation with v2 support
> ---
>
> Key: YARN-11681
> URL: https://issues.apache.org/jira/browse/YARN-11681
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
>
> Update the related 
> [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]
>  with v2 support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11692) Support mixed cgroup v1/v2 controller structure

2024-05-15 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11692.
--
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
  Resolution: Fixed

> Support mixed cgroup v1/v2 controller structure
> ---
>
> Key: YARN-11692
> URL: https://issues.apache.org/jira/browse/YARN-11692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> There were heavy changes on the device side in cgroup v2. To keep supporting 
> FGPAs and GPUs short term, mixed structures where some of the cgroup 
> controllers are from v1 while others from v2 should be supported. More info: 
> https://dropbear.xyz/2023/05/23/devices-with-cgroup-v2/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Reopened] (YARN-11669) [Umbrella] cgroup v2 support

2024-05-13 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke reopened YARN-11669:
--

> [Umbrella] cgroup v2 support
> 
>
> Key: YARN-11669
> URL: https://issues.apache.org/jira/browse/YARN-11669
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Ferenc Erdelyi
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.5.0
>
>
> cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 
> already moved to cgroup v2 as a default, hence YARN should support it. This 
> umbrella tracks the required work.
> [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]
> A way to test the newly added features:
> # Turn on cgroup v1 based on the current 
> [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html].
> # System prerequisites:
> ## the file {{/etc/mtab}} should contain a mount path with the file system 
> type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's
> ## the {{cgroup.subtree_control}} file should contain the necessary 
> controllers (update it with: {{echo "+cpu +io +memory" > 
> cgroup.subtree_control}})
> ## either create the YARN hierarchy and give recursive access to the user 
> running the NM on the node. The hierarchy is {{hadoop-yarn}} by default 
> (controller by 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and 
> recursive mode is required, because as soon as the directory is created it 
> will be filled with the controller files which YARN will try to edit.
> ### Alternatively if the NM process user has access rights on the 
> {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the 
> {{cgroup.subtree_control}} file.
> # YARN configuration
> ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should 
> point to the directory where the cgroup2 structure is mounted and the 
> {{hadoop-yarn}} hierarchy was created
> ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be 
> set to {{true}}
> ## Enable a cgroup controller, like {{yarn. nodemanager. resource. 
> cpu.enabled}}: {{true}}
> # Launch the NM and monitor the cgroup files on container launches (i.e: 
> {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}})



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11669) [Umbrella] cgroup v2 support

2024-05-13 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11669:
-
Fix Version/s: (was: 3.5.0)

> [Umbrella] cgroup v2 support
> 
>
> Key: YARN-11669
> URL: https://issues.apache.org/jira/browse/YARN-11669
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Ferenc Erdelyi
>Assignee: Benjamin Teke
>Priority: Major
>
> cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 
> already moved to cgroup v2 as a default, hence YARN should support it. This 
> umbrella tracks the required work.
> [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]
> A way to test the newly added features:
> # Turn on cgroup v1 based on the current 
> [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html].
> # System prerequisites:
> ## the file {{/etc/mtab}} should contain a mount path with the file system 
> type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's
> ## the {{cgroup.subtree_control}} file should contain the necessary 
> controllers (update it with: {{echo "+cpu +io +memory" > 
> cgroup.subtree_control}})
> ## either create the YARN hierarchy and give recursive access to the user 
> running the NM on the node. The hierarchy is {{hadoop-yarn}} by default 
> (controller by 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and 
> recursive mode is required, because as soon as the directory is created it 
> will be filled with the controller files which YARN will try to edit.
> ### Alternatively if the NM process user has access rights on the 
> {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the 
> {{cgroup.subtree_control}} file.
> # YARN configuration
> ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should 
> point to the directory where the cgroup2 structure is mounted and the 
> {{hadoop-yarn}} hierarchy was created
> ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be 
> set to {{true}}
> ## Enable a cgroup controller, like {{yarn. nodemanager. resource. 
> cpu.enabled}}: {{true}}
> # Launch the NM and monitor the cgroup files on container launches (i.e: 
> {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}})



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11669) [Umbrella] cgroup v2 support

2024-05-13 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11669.
--
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> [Umbrella] cgroup v2 support
> 
>
> Key: YARN-11669
> URL: https://issues.apache.org/jira/browse/YARN-11669
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Ferenc Erdelyi
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.5.0
>
>
> cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 
> already moved to cgroup v2 as a default, hence YARN should support it. This 
> umbrella tracks the required work.
> [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]
> A way to test the newly added features:
> # Turn on cgroup v1 based on the current 
> [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html].
> # System prerequisites:
> ## the file {{/etc/mtab}} should contain a mount path with the file system 
> type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's
> ## the {{cgroup.subtree_control}} file should contain the necessary 
> controllers (update it with: {{echo "+cpu +io +memory" > 
> cgroup.subtree_control}})
> ## either create the YARN hierarchy and give recursive access to the user 
> running the NM on the node. The hierarchy is {{hadoop-yarn}} by default 
> (controller by 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and 
> recursive mode is required, because as soon as the directory is created it 
> will be filled with the controller files which YARN will try to edit.
> ### Alternatively if the NM process user has access rights on the 
> {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the 
> {{cgroup.subtree_control}} file.
> # YARN configuration
> ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should 
> point to the directory where the cgroup2 structure is mounted and the 
> {{hadoop-yarn}} hierarchy was created
> ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be 
> set to {{true}}
> ## Enable a cgroup controller, like {{yarn. nodemanager. resource. 
> cpu.enabled}}: {{true}}
> # Launch the NM and monitor the cgroup files on container launches (i.e: 
> {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}})



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11689) Update the cgroup v2 init error handling to provide more straightforward error messages

2024-05-10 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11689:
-
Description: The method AbstractCGroupsHandler.getErrorWithDetails hides 
quite a lot of information. The cgroup v2 init should be more stable and it 
should be updated to show the exact step where the it failed.  (was: The method 
AbstractCGroupsHandler.getErrorWithDetails hides quite a lot of information. It 
would be useful to show the underlying exception and it's message as well, by 
default.)

> Update the cgroup v2 init error handling to provide more straightforward 
> error messages
> ---
>
> Key: YARN-11689
> URL: https://issues.apache.org/jira/browse/YARN-11689
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>
> The method AbstractCGroupsHandler.getErrorWithDetails hides quite a lot of 
> information. The cgroup v2 init should be more stable and it should be 
> updated to show the exact step where the it failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11689) Update the cgroup v2 init error handling to provide more straightforward error messages

2024-05-10 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11689:
-
Summary: Update the cgroup v2 init error handling to provide more 
straightforward error messages  (was: Update getErrorWithDetails method to 
provide more meaningful error messages)

> Update the cgroup v2 init error handling to provide more straightforward 
> error messages
> ---
>
> Key: YARN-11689
> URL: https://issues.apache.org/jira/browse/YARN-11689
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>
> The method AbstractCGroupsHandler.getErrorWithDetails hides quite a lot of 
> information. It would be useful to show the underlying exception and it's 
> message as well, by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11669) [Umbrella] cgroup v2 support

2024-05-10 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11669:
-
Description: 
cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already 
moved to cgroup v2 as a default, hence YARN should support it. This umbrella 
tracks the required work.

[Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]

A way to test the newly added features:
# Turn on cgroup v1 based on the current 
[documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html].
# System prerequisites:
## the file {{/etc/mtab}} should contain a mount path with the file system type 
{{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's
## the {{cgroup.subtree_control}} file should contain the necessary controllers 
(update it with: {{echo "+cpu +io +memory" > cgroup.subtree_control}})
## either create the YARN hierarchy and give recursive access to the user 
running the NM on the node. The hierarchy is {{hadoop-yarn}} by default 
(controller by 
{{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and recursive 
mode is required, because as soon as the directory is created it will be filled 
with the controller files which YARN will try to edit.
### Alternatively if the NM process user has access rights on the 
{{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the 
{{cgroup.subtree_control}} file.
# YARN configuration
## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should 
point to the directory where the cgroup2 structure is mounted and the 
{{hadoop-yarn}} hierarchy was created
## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be 
set to {{true}}
## Enable a cgroup controller, like {{yarn. nodemanager. resource. 
cpu.enabled}}: {{true}}
# Launch the NM and monitor the cgroup files on container launches (i.e: 
{{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}})


  was:
cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already 
moved to cgroup v2 as a default, hence YARN should support it. This umbrella 
tracks the required work.

[Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]

A way to test the newly added features:
# Turn on cgroup v1 based on the current 
[documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html].
# System prerequisites:
## the file {{/etc/mtab}} should contain a mount path with the file system type 
{{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's
## the {{cgroup.subtree_control}} file should contain the necessary controllers 
(update it with: {{echo "+cpu +io +memory" > cgroup.subtree_control}})
## either create the YARN hierarchy and make it owned by the user running the 
NM on the node. The hierarchy is {{hadoop-yarn}} by default (controller by 
{{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and {{chown 
-R user:group /sys/fs/cgroup/hadoop-yarn}}  is needed. -R is required, because 
as soon as the directory is created it will be filled with the controller files 
which YARN will try to edit.
### Alternatively if the NM process user has access rights on the 
{{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the 
{{cgroup.subtree_control}} file.
# YARN configuration
## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should 
point to the directory where the cgroup2 structure is mounted and the 
{{hadoop-yarn}} hierarchy was created
## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be 
set to {{true}}
## Enable a cgroup controller, like {{yarn. nodemanager. resource. 
cpu.enabled}}: {{true}}
# Launch the NM and monitor the cgroup files on container launches (i.e: 
{{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}})



> [Umbrella] cgroup v2 support
> 
>
> Key: YARN-11669
> URL: https://issues.apache.org/jira/browse/YARN-11669
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Ferenc Erdelyi
>Assignee: Benjamin Teke
>Priority: Major
>
> cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 
> already moved to cgroup v2 as a default, hence YARN should support it. This 
> umbrella tracks the required work.
> [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]
> A way to test the newly added features:
> # Turn on cgroup v1 based on the current 
> [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html].
> # System prerequisites:
> ## the file {{/etc/mtab}} should contain a mount path with the file system 
> type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's
> ## the

[jira] [Updated] (YARN-11669) [Umbrella] cgroup v2 support

2024-05-10 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11669:
-
Description: 
cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already 
moved to cgroup v2 as a default, hence YARN should support it. This umbrella 
tracks the required work.

[Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]

A way to test the newly added features:
# Turn on cgroup v1 based on the current 
[documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html].
# System prerequisites:
## the file {{/etc/mtab}} should contain a mount path with the file system type 
{{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's
## the {{cgroup.subtree_control}} file should contain the necessary controllers 
(update it with: {{echo "+cpu +io +memory" > cgroup.subtree_control}})
## either create the YARN hierarchy and make it owned by the user running the 
NM on the node. The hierarchy is {{hadoop-yarn}} by default (controller by 
{{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and {{chown 
-R user:group /sys/fs/cgroup/hadoop-yarn}}  is needed. -R is required, because 
as soon as the directory is created it will be filled with the controller files 
which YARN will try to edit.
### Alternatively if the NM process user has access rights on the 
{{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the 
{{cgroup.subtree_control}} file.
# YARN configuration
## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should 
point to the directory where the cgroup2 structure is mounted and the 
{{hadoop-yarn}} hierarchy was created
## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be 
set to {{true}}
## Enable a cgroup controller, like {{yarn. nodemanager. resource. 
cpu.enabled}}: {{true}}
# Launch the NM and monitor the cgroup files on container launches (i.e: 
{{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}})


  was:
cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already 
moved to cgroup v2 as a default, hence YARN should support it. This umbrella 
tracks the required work.

[Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]

A way to test the newly added features:
# Turn on cgroup v1 based on the current 
[documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html].
# System prerequisites:
## the file {{/etc/mtab}} should contain a mount path with the file system type 
{{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's
## the {{cgroup.subtree_control}} file should contain the necessary controllers 
(update it with: {{echo "+cpu +io +memory" > cgroup.subtree_control}})
## either create the YARN hierarchy and make it owned by the user running the 
NM on the node. The hierarchy is {{hadoop-yarn}} by default (controller by 
{{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and {{chown 
-R yarn:hadoop /sys/fs/cgroup/hadoop-yarn}}  is needed. -R is required, because 
as soon as the directory is created it will be filled with the controller files 
which YARN will try to edit.
### Alternatively if the NM process user has access rights on the 
{{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the 
{{cgroup.subtree_control}} file.
# YARN configuration
## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should 
point to the directory where the cgroup2 structure is mounted and the 
{{hadoop-yarn}} hierarchy was created
## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be 
set to {{true}}
## Enable a cgroup controller, like {{yarn. nodemanager. resource. 
cpu.enabled}}: {{true}}
# Launch the NM and monitor the cgroup files on container launches (i.e: 
{{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}})



> [Umbrella] cgroup v2 support
> 
>
> Key: YARN-11669
> URL: https://issues.apache.org/jira/browse/YARN-11669
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Ferenc Erdelyi
>Assignee: Benjamin Teke
>Priority: Major
>
> cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 
> already moved to cgroup v2 as a default, hence YARN should support it. This 
> umbrella tracks the required work.
> [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]
> A way to test the newly added features:
> # Turn on cgroup v1 based on the current 
> [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html].
> # System prerequisites:
> ## the file {{/etc/mtab}} should contain a mount path with the file system 
> type {{cgroup2}}, by default this could

[jira] [Updated] (YARN-11669) [Umbrella] cgroup v2 support

2024-05-10 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11669:
-
Description: 
cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already 
moved to cgroup v2 as a default, hence YARN should support it. This umbrella 
tracks the required work.

[Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]

A way to test the newly added features:
# Turn on cgroup v1 based on the current 
[documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html].
# System prerequisites:
## the file {{/etc/mtab}} should contain a mount path with the file system type 
{{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's
## the {{cgroup.subtree_control}} file should contain the necessary controllers 
(update it with: {{echo "+cpu +io +memory" > cgroup.subtree_control}})
## either create the YARN hierarchy and make it owned by the user running the 
NM on the node. The hierarchy is {{hadoop-yarn}} by default (controller by 
{{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and {{chown 
-R yarn:hadoop /sys/fs/cgroup/hadoop-yarn}}  is needed. -R is required, because 
as soon as the directory is created it will be filled with the controller files 
which YARN will try to edit.
### Alternatively if the NM process user has access rights on the 
{{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the 
{{cgroup.subtree_control}} file.
# YARN configuration
## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should 
point to the directory where the cgroup2 structure is mounted and the 
{{hadoop-yarn}} hierarchy was created
## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be 
set to {{true}}
## Enable a cgroup controller, like {{yarn. nodemanager. resource. 
cpu.enabled}}: {{true}}
# Launch the NM and monitor the cgroup files on container launches (i.e: 
{{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}})


  was:
cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already 
moved to cgroup v2 as a default, hence YARN should support it. This umbrella 
tracks the required work.

[Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]


> [Umbrella] cgroup v2 support
> 
>
> Key: YARN-11669
> URL: https://issues.apache.org/jira/browse/YARN-11669
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Ferenc Erdelyi
>Assignee: Benjamin Teke
>Priority: Major
>
> cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 
> already moved to cgroup v2 as a default, hence YARN should support it. This 
> umbrella tracks the required work.
> [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]
> A way to test the newly added features:
> # Turn on cgroup v1 based on the current 
> [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html].
> # System prerequisites:
> ## the file {{/etc/mtab}} should contain a mount path with the file system 
> type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's
> ## the {{cgroup.subtree_control}} file should contain the necessary 
> controllers (update it with: {{echo "+cpu +io +memory" > 
> cgroup.subtree_control}})
> ## either create the YARN hierarchy and make it owned by the user running the 
> NM on the node. The hierarchy is {{hadoop-yarn}} by default (controller by 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and {{chown 
> -R yarn:hadoop /sys/fs/cgroup/hadoop-yarn}}  is needed. -R is required, 
> because as soon as the directory is created it will be filled with the 
> controller files which YARN will try to edit.
> ### Alternatively if the NM process user has access rights on the 
> {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the 
> {{cgroup.subtree_control}} file.
> # YARN configuration
> ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should 
> point to the directory where the cgroup2 structure is mounted and the 
> {{hadoop-yarn}} hierarchy was created
> ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be 
> set to {{true}}
> ## Enable a cgroup controller, like {{yarn. nodemanager. resource. 
> cpu.enabled}}: {{true}}
> # Launch the NM and monitor the cgroup files on container launches (i.e: 
> {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}})



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11692) Support mixed cgroup v1/v2 controller structure

2024-04-29 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11692:
-
Description: There were heavy changes on the device side in cgroup v2. To 
keep supporting FGPAs and GPUs short term, mixed structures where some of the 
cgroup controllers are from v1 while others from v2 should be supported. More 
info: https://dropbear.xyz/2023/05/23/devices-with-cgroup-v2/  (was: There were 
heavy changes on the device side in cgroup v2. To keep supporting FGPAs and 
GPUs short term, mixed structures where some of the cgroup controllers are from 
v1 while others from v2 should be supported. )

> Support mixed cgroup v1/v2 controller structure
> ---
>
> Key: YARN-11692
> URL: https://issues.apache.org/jira/browse/YARN-11692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Priority: Major
>
> There were heavy changes on the device side in cgroup v2. To keep supporting 
> FGPAs and GPUs short term, mixed structures where some of the cgroup 
> controllers are from v1 while others from v2 should be supported. More info: 
> https://dropbear.xyz/2023/05/23/devices-with-cgroup-v2/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11692) Support mixed cgroup v1/v2 controller structure

2024-04-29 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11692:


 Summary: Support mixed cgroup v1/v2 controller structure
 Key: YARN-11692
 URL: https://issues.apache.org/jira/browse/YARN-11692
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


There were heavy changes on the device side in cgroup v2. To keep supporting 
FGPAs and GPUs short term, mixed structures where some of the cgroup 
controllers are from v1 while others from v2 should be supported. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11690) Update container executor to use CGROUP2_SUPER_MAGIC in cgroup 2 scenarios

2024-04-24 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11690:


 Summary: Update container executor to use CGROUP2_SUPER_MAGIC in 
cgroup 2 scenarios
 Key: YARN-11690
 URL: https://issues.apache.org/jira/browse/YARN-11690
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: container-executor
Reporter: Benjamin Teke
Assignee: Benjamin Teke


The container executor function {{write_pid_to_cgroup_as_root}} writes the PID 
of the newly launched container to the correct cgroup.procs file. However it 
checks if the file is mounted on a cgroup filesystem, and does that check using 
the magic number, which differs for v1 and v2. This should handle v1 or v2 
filesystems as well. 

{code:java}
/**
 * Write the pid of the current process to the cgroup file.
 * cgroup_file: Path to cgroup file where pid needs to be written to.
 */
static int write_pid_to_cgroup_as_root(const char* cgroup_file, pid_t pid) {
  int rc = 0;
  uid_t user = geteuid();
  gid_t group = getegid();
  if (change_effective_user(0, 0) != 0) {
rc =  -1;
goto cleanup;
  }

  // statfs
  struct statfs buf;
  if (statfs(cgroup_file, ) == -1) {
fprintf(LOGFILE, "Can't statfs file %s as node manager - %s\n", cgroup_file,
   strerror(errno));
rc = -1;
goto cleanup;
  } else if (buf.f_type != CGROUP_SUPER_MAGIC) {
fprintf(LOGFILE, "Pid file %s is not located on cgroup filesystem\n", 
cgroup_file);
rc = -1;
goto cleanup;
  }

{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11689) Update getErrorWithDetails method to provide more meaningful error messages

2024-04-24 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11689:


 Summary: Update getErrorWithDetails method to provide more 
meaningful error messages
 Key: YARN-11689
 URL: https://issues.apache.org/jira/browse/YARN-11689
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


The method AbstractCGroupsHandler.getErrorWithDetails hides quite a lot of 
information. It would be useful to show the underlying exception and it's 
message as well, by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11689) Update getErrorWithDetails method to provide more meaningful error messages

2024-04-24 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke reassigned YARN-11689:


Assignee: Benjamin Teke

> Update getErrorWithDetails method to provide more meaningful error messages
> ---
>
> Key: YARN-11689
> URL: https://issues.apache.org/jira/browse/YARN-11689
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>
> The method AbstractCGroupsHandler.getErrorWithDetails hides quite a lot of 
> information. It would be useful to show the underlying exception and it's 
> message as well, by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-24 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke reassigned YARN-11191:


Assignee: Tamas Domok

> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11687) Update CGroupsResourceCalculator to track usages using cgroupv2

2024-04-18 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11687:


 Summary: Update CGroupsResourceCalculator to track usages using 
cgroupv2
 Key: YARN-11687
 URL: https://issues.apache.org/jira/browse/YARN-11687
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


[CGroupsResourceCalculator|https://github.com/apache/hadoop/blob/f609460bda0c2bd87dd3580158e549e2f34f14d5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsResourceCalculator.java]
 should also be updated to handle the cgroup v2 changes.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11679) Update GpuResourceHandler for cgroup v2 support

2024-04-17 Thread Benjamin Teke (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838163#comment-17838163
 ] 

Benjamin Teke commented on YARN-11679:
--

GPU support is a bit tricky, as cgroup v2 doesn't have interface files on top 
of the device controller, it's now implemented on top of BPF. From the 
[docs|https://docs.kernel.org/admin-guide/cgroup-v2.html]:
{quote}Cgroup v2 device controller has no interface files and is implemented on 
top of cgroup BPF. To control access to device files, a user may create bpf 
programs of type BPF_PROG_TYPE_CGROUP_DEVICE and attach them to cgroups with 
BPF_CGROUP_DEVICE flag. On an attempt to access a device file, corresponding 
BPF programs will be executed, and depending on the return value the attempt 
will succeed or fail with -EPERM.

A BPF_PROG_TYPE_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx 
structure, which describes the device access attempt: access type 
(mknod/read/write) and device (type, major and minor numbers). If the program 
returns 0, the attempt fails with -EPERM, otherwise it succeeds.

An example of BPF_PROG_TYPE_CGROUP_DEVICE program may be found in 
tools/testing/selftests/bpf/progs/dev_cgroup.c in the kernel source tree.

{quote}

> Update GpuResourceHandler for cgroup v2 support
> ---
>
> Key: YARN-11679
> URL: https://issues.apache.org/jira/browse/YARN-11679
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Priority: Major
>
> cgroup v2 has some changes in various controllers (some changed their 
> functionality, some were removed). This task is about checking if 
> GpuResourceHandler's 
> [implementation|https://github.com/apache/hadoop/blob/e8fa192f07b6f2e7a0b03813edca03c505a8ac1b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/gpu/GpuResourceHandlerImpl.java#L45]
>  need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11685) Create a config to enable/disable cgroup v2 functionality

2024-04-16 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11685:


 Summary: Create a config to enable/disable cgroup v2 functionality
 Key: YARN-11685
 URL: https://issues.apache.org/jira/browse/YARN-11685
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


Various OS's mount the cgroup v2 differently, some of them mount both the v1 
and v2 structure, others mount a hybrid structure. To avoid initialization 
issues the cgroup v1/v2 functionality should be set by a config property.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11681) Update the cgroup documentation with v2 support

2024-04-09 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11681:


 Summary: Update the cgroup documentation with v2 support
 Key: YARN-11681
 URL: https://issues.apache.org/jira/browse/YARN-11681
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


Update the related 
[documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]
 with v2 support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11680) Update FpgaResourceHandler for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11680:


 Summary: Update FpgaResourceHandler for cgroup v2 support
 Key: YARN-11680
 URL: https://issues.apache.org/jira/browse/YARN-11680
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
FpgaResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/fpga/FpgaResourceHandlerImpl.java#L55]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11679) Update GpuResourceHandler for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11679:


 Summary: Update GpuResourceHandler for cgroup v2 support
 Key: YARN-11679
 URL: https://issues.apache.org/jira/browse/YARN-11679
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
GpuResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/e8fa192f07b6f2e7a0b03813edca03c505a8ac1b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/gpu/GpuResourceHandlerImpl.java#L45]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11678) Update CGroupElasticMemoryController for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11678:


 Summary: Update CGroupElasticMemoryController for cgroup v2 support
 Key: YARN-11678
 URL: https://issues.apache.org/jira/browse/YARN-11678
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
CGroupElasticMemoryController's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupElasticMemoryController.java#L58]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11677) Update OutboundBandwidthResourceHandler implementation for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11677:


 Summary: Update OutboundBandwidthResourceHandler implementation 
for cgroup v2 support
 Key: YARN-11677
 URL: https://issues.apache.org/jira/browse/YARN-11677
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
OutboundBandwidthResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/2064ca015d1584263aac0cc20c60b925a3aff612/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/TrafficControlBandwidthHandlerImpl.java#L43]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11676) Update CGroupsBlkioResourceHandler implementation for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11676:


 Summary: Update CGroupsBlkioResourceHandler implementation for 
cgroup v2 support
 Key: YARN-11676
 URL: https://issues.apache.org/jira/browse/YARN-11676
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
CGroupsBlkioResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsBlkioResourceHandlerImpl.java#L46]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11675) Update MemoryResourceHandler implementation for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11675:


 Summary: Update MemoryResourceHandler implementation for cgroup v2 
support
 Key: YARN-11675
 URL: https://issues.apache.org/jira/browse/YARN-11675
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
MemoryResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11674) Update CpuResourceHandler implementation for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11674:


 Summary: Update CpuResourceHandler implementation for cgroup v2 
support
 Key: YARN-11674
 URL: https://issues.apache.org/jira/browse/YARN-11674
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
CpuResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsCpuResourceHandlerImpl.java#L60]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11673) Extend the cgroup mount functionality to mount the v2 structure

2024-04-09 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11673:


 Summary: Extend the cgroup mount functionality to mount the v2 
structure
 Key: YARN-11673
 URL: https://issues.apache.org/jira/browse/YARN-11673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


YARN has a --mount-cgroup operation in the 
[container-executor|https://github.com/apache/hadoop/blob/9c7b8cf54ea88833d54fc71a9612c448dc0eb78d/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L2929]
 which mounts each controller's cgroup folder to a specified path. In cgroup v2 
the controller structure changed, it's flat now, so there are no more separate 
controller paths. To keep being compatible with v1 a new mount method should be 
added, but its functionality could be simplified quite a bit for v2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11672) Create a CgroupHandler implementation for cgroup v2

2024-04-09 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11672:


 Summary: Create a CgroupHandler implementation for cgroup v2
 Key: YARN-11672
 URL: https://issues.apache.org/jira/browse/YARN-11672
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke
Assignee: Benjamin Teke


[CGroupsHandler's|https://github.com/apache/hadoop/blob/69b328943edf2f61c8fc139934420e3f10bf3813/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsHandler.java#L36]
 current implementation holds the functionality to mount and setup the YARN 
specific cgroup v1 functionality. A similar v2 implementation should be created 
that allows initialising the v2 structure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11669) [Umbrella] cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11669:
-
Summary: [Umbrella] cgroup v2 support  (was: cgroups v2 support for YARN)

> [Umbrella] cgroup v2 support
> 
>
> Key: YARN-11669
> URL: https://issues.apache.org/jira/browse/YARN-11669
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Ferenc Erdelyi
>Assignee: Benjamin Teke
>Priority: Major
>
> The cgroups v2 is becoming the default for OSs, like RHEL9.
> Support for YARN has to be implemented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11669) [Umbrella] cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11669:
-
Description: 
cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already 
moved to cgroup v2 as a default, hence YARN should support it. This umbrella 
tracks the required work.

[Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]

  was:
The cgroups v2 is becoming the default for OSs, like RHEL9.
Support for YARN has to be implemented.


> [Umbrella] cgroup v2 support
> 
>
> Key: YARN-11669
> URL: https://issues.apache.org/jira/browse/YARN-11669
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Ferenc Erdelyi
>Assignee: Benjamin Teke
>Priority: Major
>
> cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 
> already moved to cgroup v2 as a default, hence YARN should support it. This 
> umbrella tracks the required work.
> [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-5305) Yarn Application Log Aggregation fails due to NM can not get correct HDFS delegation token III

2024-03-20 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-5305.
-
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Yarn Application Log Aggregation fails due to NM can not get correct HDFS 
> delegation token III
> --
>
> Key: YARN-5305
> URL: https://issues.apache.org/jira/browse/YARN-5305
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Xianyin Xin
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Different with YARN-5098 and YARN-5302, this problem happens when AM submits 
> a startContainer request with a new HDFS token (say, tokenB) which is not 
> managed by YARN, so two tokens exist in the credentials of the user on NM, 
> one is tokenB, the other is the one renewed on RM (tokenA). If tokenB is 
> selected when connect to HDFS and tokenB expires, exception happens.
> Supplementary: this problem happen due to that AM didn't use the service name 
> as the token alias in credentials, so two tokens for the same service can 
> co-exist in one credentials. TokenSelector can only select the first matched 
> token, it doesn't care if the token is valid or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10889) [Umbrella] Queue Creation in Capacity Scheduler - Tech debts

2024-01-26 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-10889.
--
   Fix Version/s: 3.4.0
Target Version/s: 3.4.0
  Resolution: Fixed

> [Umbrella] Queue Creation in Capacity Scheduler - Tech debts
> 
>
> Key: YARN-10889
> URL: https://issues.apache.org/jira/browse/YARN-10889
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
>
> Follow-up of YARN-10496



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10889) [Umbrella] Queue Creation in Capacity Scheduler - Tech debts

2024-01-26 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-10889:
-
Fix Version/s: 3.5.0

> [Umbrella] Queue Creation in Capacity Scheduler - Tech debts
> 
>
> Key: YARN-10889
> URL: https://issues.apache.org/jira/browse/YARN-10889
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0, 3.5.0
>
>
> Follow-up of YARN-10496



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup

2024-01-26 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11041.
--
Resolution: Fixed

> Replace all occurences of queuePath with the new QueuePath class - followup
> ---
>
> Key: YARN-11041
> URL: https://issues.apache.org/jira/browse/YARN-11041
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Tibor Kovács
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.
>  
> A lot of changes are introduced via ticket YARN-10982. The replacing should 
> be continued by touching the next comments:
>  
> [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937]
> I think this could be also refactored in a follow-up jira so the string magic 
> could probably be replaced with some more elegant solution. Though, I think 
> this would be too much in this patch, hence I do suggest the follow-up jira.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096]
> [~bteke] [ |https://github.com/9uapaw] [~gandras] [ 
> \|https://github.com/9uapaw] Thoughts?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750]
> +1, even the QueuePath object could have some kind of support for this.|
> |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244]
> Agreed, let's handle it in a followup!|
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717]
> There are many string operations in this class:
> E.g. * getQueuePrefix that works with the full queue path
>  * getNodeLabelPrefix that also works with the full queue path|
> I suggest to create a static class, called "QueuePrefixes" or something like 
> that and add some static methods there to convert the QueuePath object to 
> those various queue prefix strings that are ultimately keys in the 
> Configuration object.
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119]
> This seems hacky, just based on the constructor parameter names of QueuePath: 
> parent, leaf.
> The AQC Template prefix is not the leaf, obviously.
> Could we somehow circumvent this?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207]
> Maybe a factory method could be created, which returns a new QueuePath with 
> the parent set as the original queuePath. I.e 
> rootQueuePath.createChild(String childName) -> this could return a new 
> QueuePath object with root.childName path, and rootQueuePath as parent.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033]
> Looking at this getQueues method, I realized almost all the callers are using 
> some kind of string magic that should be addressed with this patch.
> For example, take a look at: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue
> I think getQueues should also receive the QueuePath object instead of 
> Strings.|
>  
> 
>  
> [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b]
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967]
> Nit: Gets the queue path object.
> The object of the queue suggests a CSQueue object.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765922133]
> Will fix the nit upon commit if I'm fine with the whole patch. Thanks for 
> noticing.|
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (YARN-10889) [Umbrella] Queue Creation in Capacity Scheduler - Tech debts

2024-01-25 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke reassigned YARN-10889:


Assignee: Benjamin Teke  (was: Szilard Nemeth)

> [Umbrella] Queue Creation in Capacity Scheduler - Tech debts
> 
>
> Key: YARN-10889
> URL: https://issues.apache.org/jira/browse/YARN-10889
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Benjamin Teke
>Priority: Major
>
> Follow-up of YARN-10496



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10921) AbstractCSQueue: Node Labels logic is scattered and iteration logic is repeated all over the place

2024-01-25 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-10921:
-
Parent Issue: YARN-11652  (was: YARN-10889)

> AbstractCSQueue: Node Labels logic is scattered and iteration logic is 
> repeated all over the place
> --
>
> Key: YARN-10921
> URL: https://issues.apache.org/jira/browse/YARN-10921
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Peter Szucs
>Priority: Minor
>
> TODO items:
> - Check original Node labels epic / jiras?
> - Think about ways to improve repetitive iteration on configuredNodeLabels
> - Search for: "String label" in code
> Code blocks to handle Node labels:
> - AbstractCSQueue#setupQueueConfigs
> - AbstractCSQueue#getQueueConfigurations
> - AbstractCSQueue#accessibleToPartition
> - AbstractCSQueue#getNodeLabelsForQueue
> - AbstractCSQueue#updateAbsoluteCapacities
> - AbstractCSQueue#updateConfigurableResourceRequirement
> - CSQueueUtils#loadCapacitiesByLabelsFromConf
> - AutoCreatedLeafQueue



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10888) [Umbrella] New capacity modes for CS

2024-01-25 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-10888.
--
Resolution: Fixed

> [Umbrella] New capacity modes for CS
> 
>
> Key: YARN-10888
> URL: https://issues.apache.org/jira/browse/YARN-10888
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: capacity_scheduler_queue_capacity.pdf
>
>
> *Investigate how resource allocation configuration could be more consistent 
> in CapacityScheduler*
> It would be nice if everywhere where a capacity can be defined could be 
> defined the same way:
>  * With fixed amounts (e.g. 1 GB memory, 8 vcores, 3 GPU)
>  * With percentages
>  ** Percentage of all resources (eg 10% of all memory, vcore, GPU)
>  ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU)
>  * Allow mixing different modes under one hierarchy but not under the same 
> parent queues.
> We need to determine all configuration options where capacities can be 
> defined, and see if it is possible to extend the configuration, or if it 
> makes sense in that case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11652) [Umbrella] Follow-up after YARN-10888/YARN-10889

2024-01-25 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11652:


 Summary: [Umbrella] Follow-up after YARN-10888/YARN-10889
 Key: YARN-11652
 URL: https://issues.apache.org/jira/browse/YARN-11652
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.5.0
Reporter: Benjamin Teke
Assignee: Benjamin Teke


Follow-up improvements after the changes in YARN-10888/YARN-10889.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10886) Cluster based and parent based max capacity in Capacity Scheduler

2024-01-25 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-10886:
-
Parent Issue: YARN-11652  (was: YARN-10888)

> Cluster based and parent based max capacity in Capacity Scheduler
> -
>
> Key: YARN-10886
> URL: https://issues.apache.org/jira/browse/YARN-10886
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Priority: Major
>
> We want to introduce the percentage modes relative to the cluster, not the 
> parent, i.e 
>  The property root.users.maximum-capacity will mean one of the following 
> things:
> *Either Parent Percentage:* maximum capacity relative to its parent. If it’s 
> set to 50, then it means that the capacity is capped with respect to the 
> parent. This can be covered by the current format, no change there.
>  *Or Cluster Percentage:* maximum capacity expressed as a percentage of the 
> overall cluster capacity. This case is the new scenario, for example:
>  yarn.scheduler.capacity.root.users.max-capacity = c:50%
>  yarn.scheduler.capacity.root.users.max-capacity = c:50%, c:30%



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-10888) [Umbrella] New capacity modes for CS

2024-01-25 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke reassigned YARN-10888:


Assignee: Benjamin Teke  (was: Szilard Nemeth)

> [Umbrella] New capacity modes for CS
> 
>
> Key: YARN-10888
> URL: https://issues.apache.org/jira/browse/YARN-10888
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: capacity_scheduler_queue_capacity.pdf
>
>
> *Investigate how resource allocation configuration could be more consistent 
> in CapacityScheduler*
> It would be nice if everywhere where a capacity can be defined could be 
> defined the same way:
>  * With fixed amounts (e.g. 1 GB memory, 8 vcores, 3 GPU)
>  * With percentages
>  ** Percentage of all resources (eg 10% of all memory, vcore, GPU)
>  ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU)
>  * Allow mixing different modes under one hierarchy but not under the same 
> parent queues.
> We need to determine all configuration options where capacities can be 
> defined, and see if it is possible to extend the configuration, or if it 
> makes sense in that case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Reopened] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup

2024-01-25 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke reopened YARN-11041:
--

Reopening because of a compilation failure in the original PR.

> Replace all occurences of queuePath with the new QueuePath class - followup
> ---
>
> Key: YARN-11041
> URL: https://issues.apache.org/jira/browse/YARN-11041
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Tibor Kovács
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.
>  
> A lot of changes are introduced via ticket YARN-10982. The replacing should 
> be continued by touching the next comments:
>  
> [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937]
> I think this could be also refactored in a follow-up jira so the string magic 
> could probably be replaced with some more elegant solution. Though, I think 
> this would be too much in this patch, hence I do suggest the follow-up jira.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096]
> [~bteke] [ |https://github.com/9uapaw] [~gandras] [ 
> \|https://github.com/9uapaw] Thoughts?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750]
> +1, even the QueuePath object could have some kind of support for this.|
> |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244]
> Agreed, let's handle it in a followup!|
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717]
> There are many string operations in this class:
> E.g. * getQueuePrefix that works with the full queue path
>  * getNodeLabelPrefix that also works with the full queue path|
> I suggest to create a static class, called "QueuePrefixes" or something like 
> that and add some static methods there to convert the QueuePath object to 
> those various queue prefix strings that are ultimately keys in the 
> Configuration object.
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119]
> This seems hacky, just based on the constructor parameter names of QueuePath: 
> parent, leaf.
> The AQC Template prefix is not the leaf, obviously.
> Could we somehow circumvent this?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207]
> Maybe a factory method could be created, which returns a new QueuePath with 
> the parent set as the original queuePath. I.e 
> rootQueuePath.createChild(String childName) -> this could return a new 
> QueuePath object with root.childName path, and rootQueuePath as parent.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033]
> Looking at this getQueues method, I realized almost all the callers are using 
> some kind of string magic that should be addressed with this patch.
> For example, take a look at: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue
> I think getQueues should also receive the QueuePath object instead of 
> Strings.|
>  
> 
>  
> [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b]
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967]
> Nit: Gets the queue path object.
> The object of the queue suggests a CSQueue object.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765922133]
> Will fix the nit upon commit if I'm fine with the whole patch. Thanks for 
> noticing.|
>  
>  



--
This message was sent by

[jira] [Updated] (YARN-10888) [Umbrella] New capacity modes for CS

2024-01-25 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-10888:
-
Fix Version/s: 3.4.0

> [Umbrella] New capacity modes for CS
> 
>
> Key: YARN-10888
> URL: https://issues.apache.org/jira/browse/YARN-10888
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: capacity_scheduler_queue_capacity.pdf
>
>
> *Investigate how resource allocation configuration could be more consistent 
> in CapacityScheduler*
> It would be nice if everywhere where a capacity can be defined could be 
> defined the same way:
>  * With fixed amounts (e.g. 1 GB memory, 8 vcores, 3 GPU)
>  * With percentages
>  ** Percentage of all resources (eg 10% of all memory, vcore, GPU)
>  ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU)
>  * Allow mixing different modes under one hierarchy but not under the same 
> parent queues.
> We need to determine all configuration options where capacities can be 
> defined, and see if it is possible to extend the configuration, or if it 
> makes sense in that case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup

2024-01-24 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11041.
--
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Replace all occurences of queuePath with the new QueuePath class - followup
> ---
>
> Key: YARN-11041
> URL: https://issues.apache.org/jira/browse/YARN-11041
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Tibor Kovács
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.
>  
> A lot of changes are introduced via ticket YARN-10982. The replacing should 
> be continued by touching the next comments:
>  
> [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937]
> I think this could be also refactored in a follow-up jira so the string magic 
> could probably be replaced with some more elegant solution. Though, I think 
> this would be too much in this patch, hence I do suggest the follow-up jira.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096]
> [~bteke] [ |https://github.com/9uapaw] [~gandras] [ 
> \|https://github.com/9uapaw] Thoughts?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750]
> +1, even the QueuePath object could have some kind of support for this.|
> |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244]
> Agreed, let's handle it in a followup!|
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717]
> There are many string operations in this class:
> E.g. * getQueuePrefix that works with the full queue path
>  * getNodeLabelPrefix that also works with the full queue path|
> I suggest to create a static class, called "QueuePrefixes" or something like 
> that and add some static methods there to convert the QueuePath object to 
> those various queue prefix strings that are ultimately keys in the 
> Configuration object.
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119]
> This seems hacky, just based on the constructor parameter names of QueuePath: 
> parent, leaf.
> The AQC Template prefix is not the leaf, obviously.
> Could we somehow circumvent this?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207]
> Maybe a factory method could be created, which returns a new QueuePath with 
> the parent set as the original queuePath. I.e 
> rootQueuePath.createChild(String childName) -> this could return a new 
> QueuePath object with root.childName path, and rootQueuePath as parent.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033]
> Looking at this getQueues method, I realized almost all the callers are using 
> some kind of string magic that should be addressed with this patch.
> For example, take a look at: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue
> I think getQueues should also receive the QueuePath object instead of 
> Strings.|
>  
> 
>  
> [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b]
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967]
> Nit: Gets the queue path object.
> The object of the queue suggests a CSQueue object.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765922133]
> Will fix the nit upon commit if I'm fine with the whole patch. Thanks for 
> noticing.|
>  
>  



--
This message

[jira] [Updated] (YARN-11645) Fix flaky json assert tests in TestRMWebServices

2024-01-24 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11645:
-
Fix Version/s: 3.5.0
   (was: 3.4.0)

> Fix flaky json assert tests in TestRMWebServices
> 
>
> Key: YARN-11645
> URL: https://issues.apache.org/jira/browse/YARN-11645
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.4.0
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> TestRMWebServicesCapacitySchedDynamicConfig and 
> TestRMWebServicesCapacitySchedulerMixedMode are flaky due to changes in the 
> queue order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

2024-01-24 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11639:
-
Affects Version/s: 3.5.0
   (was: 3.4.0)

> ConcurrentModificationException and NPE in 
> PriorityUtilizationQueueOrderingPolicy
> -
>
> Key: YARN-11639
> URL: https://issues.apache.org/jira/browse/YARN-11639
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.4, 3.3.6, 3.5.0
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>  Labels: pull-request-available
>
> When dynamic queue creation is enabled in weight mode and the deletion policy 
> coincides with the PriorityQueueResourcesForSorting, RM stops assigning 
> resources because of either ConcurrentModificationException or NPE in 
> PriorityUtilizationQueueOrderingPolicy.
> Reproduced the NPE issue in Java8 and Java11 environment:
> {code:java}
> ... INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Removing queue: root.dyn.PmvkMgrEBQppu
> 2024-01-02 17:00:59,399 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[Thread-11,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225)
>   at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
>   at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
>   at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
>   at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
>   at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
>   at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
> {code}
> Observed the ConcurrentModificationException in Java8 environment, but could 
> not reproduce yet:
> {code:java}
> 2023-10-27 02:50:37,584 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread 
> Thread[Thread-15,5, main] threw an Exception.
> java.util.ConcurrentModificationException
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at 
>

[jira] [Updated] (YARN-11645) Fix flaky json assert tests in TestRMWebServices

2024-01-24 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11645:
-
Affects Version/s: 3.5.0
   (was: 3.4.0)

> Fix flaky json assert tests in TestRMWebServices
> 
>
> Key: YARN-11645
> URL: https://issues.apache.org/jira/browse/YARN-11645
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.5.0
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> TestRMWebServicesCapacitySchedDynamicConfig and 
> TestRMWebServicesCapacitySchedulerMixedMode are flaky due to changes in the 
> queue order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

2024-01-24 Thread Benjamin Teke (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810368#comment-17810368
 ] 

Benjamin Teke edited comment on YARN-11639 at 1/24/24 12:48 PM:


[~bender] Thanks for checking. No, simply just create the backport PRs under 
this jira, they'll be automatically added as links to this one. 


was (Author: bteke):
[~bender] no, simply just create the backport PRs under this jira, they'll be 
automatically added as links to this one.

> ConcurrentModificationException and NPE in 
> PriorityUtilizationQueueOrderingPolicy
> -
>
> Key: YARN-11639
> URL: https://issues.apache.org/jira/browse/YARN-11639
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>  Labels: pull-request-available
>
> When dynamic queue creation is enabled in weight mode and the deletion policy 
> coincides with the PriorityQueueResourcesForSorting, RM stops assigning 
> resources because of either ConcurrentModificationException or NPE in 
> PriorityUtilizationQueueOrderingPolicy.
> Reproduced the NPE issue in Java8 and Java11 environment:
> {code:java}
> ... INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Removing queue: root.dyn.PmvkMgrEBQppu
> 2024-01-02 17:00:59,399 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[Thread-11,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225)
>   at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
>   at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
>   at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
>   at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
>   at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
>   at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
> {code}
> Observed the ConcurrentModificationException in Java8 environment, but could 
> not reproduce yet:
> {code:java}
> 2023-10-27 02:50:37,584 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread 
> Thread[Thread-15,5, main] threw an Exception.
> java.util.ConcurrentModificationException
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
>

[jira] [Commented] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

2024-01-24 Thread Benjamin Teke (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810368#comment-17810368
 ] 

Benjamin Teke commented on YARN-11639:
--

[~bender] no, simply just create the backport PRs under this jira, they'll be 
automatically added as links to this one.

> ConcurrentModificationException and NPE in 
> PriorityUtilizationQueueOrderingPolicy
> -
>
> Key: YARN-11639
> URL: https://issues.apache.org/jira/browse/YARN-11639
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>  Labels: pull-request-available
>
> When dynamic queue creation is enabled in weight mode and the deletion policy 
> coincides with the PriorityQueueResourcesForSorting, RM stops assigning 
> resources because of either ConcurrentModificationException or NPE in 
> PriorityUtilizationQueueOrderingPolicy.
> Reproduced the NPE issue in Java8 and Java11 environment:
> {code:java}
> ... INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Removing queue: root.dyn.PmvkMgrEBQppu
> 2024-01-02 17:00:59,399 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[Thread-11,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225)
>   at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
>   at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
>   at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
>   at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
>   at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
>   at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
> {code}
> Observed the ConcurrentModificationException in Java8 environment, but could 
> not reproduce yet:
> {code:java}
> 2023-10-27 02:50:37,584 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread 
> Thread[Thread-15,5, main] threw an Exception.
> java.util.ConcurrentModificationException
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at

[jira] [Updated] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

2024-01-24 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11639:
-
Affects Version/s: 3.3.6
   3.2.4
   3.4.0

> ConcurrentModificationException and NPE in 
> PriorityUtilizationQueueOrderingPolicy
> -
>
> Key: YARN-11639
> URL: https://issues.apache.org/jira/browse/YARN-11639
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.4.0, 3.2.4, 3.3.6
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>  Labels: pull-request-available
>
> When dynamic queue creation is enabled in weight mode and the deletion policy 
> coincides with the PriorityQueueResourcesForSorting, RM stops assigning 
> resources because of either ConcurrentModificationException or NPE in 
> PriorityUtilizationQueueOrderingPolicy.
> Reproduced the NPE issue in Java8 and Java11 environment:
> {code:java}
> ... INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Removing queue: root.dyn.PmvkMgrEBQppu
> 2024-01-02 17:00:59,399 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[Thread-11,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225)
>   at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
>   at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
>   at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
>   at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
>   at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
>   at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
> {code}
> Observed the ConcurrentModificationException in Java8 environment, but could 
> not reproduce yet:
> {code:java}
> 2023-10-27 02:50:37,584 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread 
> Thread[Thread-15,5, main] threw an Exception.
> java.util.ConcurrentModificationException
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)

[jira] [Updated] (YARN-11645) Fix flaky json assert tests in TestRMWebServices

2024-01-24 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11645:
-
Affects Version/s: 3.4.0
   (was: 3.5.0)

> Fix flaky json assert tests in TestRMWebServices
> 
>
> Key: YARN-11645
> URL: https://issues.apache.org/jira/browse/YARN-11645
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.4.0
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> TestRMWebServicesCapacitySchedDynamicConfig and 
> TestRMWebServicesCapacitySchedulerMixedMode are flaky due to changes in the 
> queue order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11645) Fix flaky json assert tests in TestRMWebServices

2024-01-24 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11645.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix flaky json assert tests in TestRMWebServices
> 
>
> Key: YARN-11645
> URL: https://issues.apache.org/jira/browse/YARN-11645
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.5.0
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> TestRMWebServicesCapacitySchedDynamicConfig and 
> TestRMWebServicesCapacitySchedulerMixedMode are flaky due to changes in the 
> queue order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

2024-01-22 Thread Benjamin Teke (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809577#comment-17809577
 ] 

Benjamin Teke commented on YARN-11639:
--

Thanks [~bender] for the patch. Can you please check if branch-3.3/3.2 backport 
is needed?

> ConcurrentModificationException and NPE in 
> PriorityUtilizationQueueOrderingPolicy
> -
>
> Key: YARN-11639
> URL: https://issues.apache.org/jira/browse/YARN-11639
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>  Labels: pull-request-available
>
> When dynamic queue creation is enabled in weight mode and the deletion policy 
> coincides with the PriorityQueueResourcesForSorting, RM stops assigning 
> resources because of either ConcurrentModificationException or NPE in 
> PriorityUtilizationQueueOrderingPolicy.
> Reproduced the NPE issue in Java8 and Java11 environment:
> {code:java}
> ... INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Removing queue: root.dyn.PmvkMgrEBQppu
> 2024-01-02 17:00:59,399 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[Thread-11,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225)
>   at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
>   at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
>   at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
>   at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
>   at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
>   at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
> {code}
> Observed the ConcurrentModificationException in Java8 environment, but could 
> not reproduce yet:
> {code:java}
> 2023-10-27 02:50:37,584 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread 
> Thread[Thread-15,5, main] threw an Exception.
> java.util.ConcurrentModificationException
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at

[jira] [Resolved] (YARN-11634) Speed-up TestTimelineClient

2023-12-20 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11634.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Speed-up TestTimelineClient
> ---
>
> Key: YARN-11634
> URL: https://issues.apache.org/jira/browse/YARN-11634
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The TimelineConnector.class has a hardcoded 1-minute connection time out, 
> which makes the TestTimelineClient a long-running test (~15:30 min).
> Decreasing the timeout to 10ms will speed up the test run (~56 sec).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11630) Passing admin Java options to container localizers

2023-12-14 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11630.
--
Hadoop Flags: Reviewed
Target Version/s: 3.4.0
  Resolution: Fixed

> Passing admin Java options to container localizers
> --
>
> Key: YARN-11630
> URL: https://issues.apache.org/jira/browse/YARN-11630
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Peter Szucs
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Currently we can specify Java options for container localizers in 
> _"yarn.nodemanager.container-localizer.java.opts"_ parameter.
> The aim of this ticket is to create a parameter which we can use to pass 
> admin options as well. It would work similarly as the admin Java options we 
> can pass for Mapreduce jobs, first we should pass the admin options to the 
> container executor, then the user-defined ones.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11621) Fix intermittently failing unit test: TestAMRMProxy.testAMRMProxyTokenRenewal

2023-12-05 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11621.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix intermittently failing unit test: TestAMRMProxy.testAMRMProxyTokenRenewal
> -
>
> Key: YARN-11621
> URL: https://issues.apache.org/jira/browse/YARN-11621
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 3.3.6
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> This test seems to be flaky as it failed 3 times out of 200 runs based on the 
> trunk.
> This was fixed earlier with YARN-7020, but it seems it didn't cover all the 
> flakiness.
> h3.  
> {code:java}
> Error Message
> Application attempt appattempt_1630750910491_0001_01 doesn't exist in 
> ApplicationMasterService cache. at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:407)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor$3.allocate(DefaultRequestInterceptor.java:224)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor.allocate(DefaultRequestInterceptor.java:135)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.amrmproxy.AMRMProxyService.allocate(AMRMProxyService.java:329)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
> Stacktrace
> org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: 
> Application attempt appattempt_1630750910491_0001_01 doesn't exist in 
> ApplicationMasterService cache. at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:407)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor$3.allocate(DefaultRequestInterceptor.java:224)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor.allocate(DefaultRequestInterceptor.java:135)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.amrmproxy.AMRMProxyService.allocate(AMRMProxyService.java:329)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75) 
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116) 
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
>

[jira] [Created] (YARN-11608) QueueCapacityVectorInfo NPE when accesible labels config is used

2023-11-03 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11608:


 Summary: QueueCapacityVectorInfo NPE when accesible labels config 
is used
 Key: YARN-11608
 URL: https://issues.apache.org/jira/browse/YARN-11608
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.4.0
Reporter: Benjamin Teke
Assignee: Benjamin Teke


YARN-11514 extended the REST API to contain CapacityVectors for each configured 
node label. There is an edgecase however: during the initialization the each 
queue's capacities map will be filled with 0 capacities for the unconfigured, 
but accessible labels (where there is no configured capacity for the label, 
however the queue has access to it based on the accessible-node-labels 
property). A very basic example configuration for this is the following:

{code:java}
"yarn.scheduler.capacity.root.queues": "a, b"
 "yarn.scheduler.capacity.root.a.capacity": "50");
 "yarn.scheduler.capacity.root.a.accessible-node-labels": "root-a-default-label"
 "yarn.scheduler.capacity.root.a.maximum-capacity": "50"
 "yarn.scheduler.capacity.root.b.capacity": "50"
{code}

root.a has access to root-a-default-label, however there is no configured 
capacity for it. The capacityVectors are parsed based on the configuredCapacity 
map (created from the "accessible-node-labels..capacity" configs). When 
the scheduler info is requested the capacityVectors are collected per label, 
and the labels used for this are the keySet of the capacity map:

{code:java}
for (String partitionName : capacities.getExistingNodeLabels()) {
  QueueCapacityVector queueCapacityVector = 
  queue.getConfiguredCapacityVector(partitionName);
  queueCapacityVectorInfo = queueCapacityVector == null ?
  new QueueCapacityVectorInfo(new QueueCapacityVector()) :
  new 
QueueCapacityVectorInfo(queue.getConfiguredCapacityVector(partitionName));
{code}

{code:java}
public Set getExistingNodeLabels() {
readLock.lock();
try {
  return new HashSet(capacitiesMap.keySet());
} finally {
  readLock.unlock();
}
  }
{code}

If the capacitiesMap contains entries that are not "configured", this will 
result in an NPE, breaking the UI and the REST API:

{code:java}
INTERNAL_SERVER_ERROR
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.QueueCapacityVectorInfo.(QueueCapacityVectorInfo.java:39)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.QueueCapacitiesInfo.(QueueCapacitiesInfo.java:61)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerLeafQueueInfo.populateQueueCapacities(CapacitySchedulerLeafQueueInfo.java:108)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerQueueInfo.(CapacitySchedulerQueueInfo.java:137)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerLeafQueueInfo.(CapacitySchedulerLeafQueueInfo.java:66)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.getQueues(CapacitySchedulerInfo.java:197)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.(CapacitySchedulerInfo.java:94)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:399)
{code}

There is no need to create capacityVectors for the unconfigured labels, so a 
null check should solve this issue on the API side.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11584) [CS] Attempting to create Leaf Queue with empty shortname should fail without crashing RM

2023-10-31 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11584:
-
Fix Version/s: 3.4.0

> [CS] Attempting to create Leaf Queue with empty shortname should fail without 
> crashing RM
> -
>
> Key: YARN-11584
> URL: https://issues.apache.org/jira/browse/YARN-11584
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Brian Goerlitz
>Assignee: Brian Goerlitz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> If an app submission results in attempting to auto-create a leaf queue with 
> an empty short name, the app submission should be rejected without the RM 
> crashing. Currently, the queue will be created, but the RM encounters a FATAL 
> exception due to metrics collision.
> For example, if an app is placed to 'root.' the RM will fail with the below.
> {noformat}
> 2023-09-12 20:23:43,294 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
> Error in handling event type APP_ADDED to the Event Dispatcher
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> QueueMetrics,q0=root already exists!
> at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
> at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
> at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics.forQueue(CSQueueMetrics.java:309)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.(AbstractCSQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue.(AbstractLeafQueue.java:148)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:42)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.createNewQueue(ParentQueue.java:495)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.addDynamicChildQueue(ParentQueue.java:563)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.addDynamicLeafQueue(ParentQueue.java:517)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.createAutoQueue(CapacitySchedulerQueueManager.java:678)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.createQueue(CapacitySchedulerQueueManager.java:511)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getOrCreateQueueFromPlacementContext(CapacityScheduler.java:898)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplication(CapacityScheduler.java:962)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1920)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:170)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11578) Fix performance issue of permission check in verifyAndCreateRemoteLogDir

2023-10-02 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11578:
-
Fix Version/s: 3.4.0

> Fix performance issue of permission check in verifyAndCreateRemoteLogDir
> 
>
> Key: YARN-11578
> URL: https://issues.apache.org/jira/browse/YARN-11578
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> YARN-10901 introduced a check to avoid a warn message in NN logs in certain 
> situations (when /tmp/logs is not owned by the yarn user), but it adds 3 
> NameNode calls (create, setpermission, delete) during log aggregation 
> collection, for *every* NM. Meaning, when a YARN job completes, at the YARN 
> log aggregation phase this check is done for every job, from every 
> NodeManager.
> In 30 minutes 4.2 % of all the NameNode calls were due to this in a cluster. 
> "write" calls need a Namesystem writeLock as well, so the impact is bigger.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11578) Fix performance issue of permission check in verifyAndCreateRemoteLogDir

2023-10-02 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11578.
--
Hadoop Flags: Reviewed
  Resolution: Fixed

> Fix performance issue of permission check in verifyAndCreateRemoteLogDir
> 
>
> Key: YARN-11578
> URL: https://issues.apache.org/jira/browse/YARN-11578
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> YARN-10901 introduced a check to avoid a warn message in NN logs in certain 
> situations (when /tmp/logs is not owned by the yarn user), but it adds 3 
> NameNode calls (create, setpermission, delete) during log aggregation 
> collection, for *every* NM. Meaning, when a YARN job completes, at the YARN 
> log aggregation phase this check is done for every job, from every 
> NodeManager.
> In 30 minutes 4.2 % of all the NameNode calls were due to this in a cluster. 
> "write" calls need a Namesystem writeLock as well, so the impact is bigger.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11567) Aggregate container launch debug artifacts automatically in case of error

2023-09-22 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11567:
-
Fix Version/s: 3.4.0

> Aggregate container launch debug artifacts automatically in case of error
> -
>
> Key: YARN-11567
> URL: https://issues.apache.org/jira/browse/YARN-11567
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> In cases where a container fails to launch without writing to a log file, we 
> often would want to see the artifacts captured by 
> {{yarn.nodemanager.log-container-debug-info.enabled}} in order to better 
> understand the cause of the exit code. To enable this feature for every 
> container maybe over kill, so we need a feature flag to capture these 
> artifacts in case of errors. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-11514) Extend SchedulerResponse with capacityVector

2023-08-25 Thread Benjamin Teke (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759092#comment-17759092
 ] 

Benjamin Teke edited comment on YARN-11514 at 8/25/23 3:53 PM:
---

After some trials I suggest moving forward with the current jsonProvider. The 
ideal solution would be to replace Jettison with Jackson, as it would make 
adding new/custom fields way easier, but achieving exactly the same result 
would mean a lot of code change and even custom solutions, which has more risk 
than benefits especially because we're adding a simple map containing 2-4 
elements. And since this is a public API we risk breaking other dependent 
components. So the current implementation looks like this:

 
{code:java}
"queueCapacityVectorInfo" : {
"configuredCapacityVector" : 
"[memory-mb=12.5%,vcores=12.5%]",
"capacityVectorEntries" : [ {
  "resourceName" : "memory-mb",
  "resourceValue" : "12.5%"
}, {
  "resourceName" : "vcores",
  "resourceValue" : "12.5%"
} ]
  },
{code}



was (Author: bteke):
After some trials I suggest moving forward with the current jsonProvider. The 
ideal solution would be to replace Jettison with Jackson, as it would make 
adding new/custom fields way easier, but achieving exactly the same result 
would mean a lot of code change and even custom solutions, which isn't 
necessarily worth it. And since this is a public API we risk breaking other 
dependent components. So the current implementation looks like this:

 
{code:java}
"queueCapacityVectorInfo" : {
"configuredCapacityVector" : 
"[memory-mb=12.5%,vcores=12.5%]",
"capacityVectorEntries" : [ {
  "resourceName" : "memory-mb",
  "resourceValue" : "12.5%"
}, {
  "resourceName" : "vcores",
  "resourceValue" : "12.5%"
} ]
  },
{code}


> Extend SchedulerResponse with capacityVector
> 
>
> Key: YARN-11514
> URL: https://issues.apache.org/jira/browse/YARN-11514
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tamas Domok
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
>
> The goal is to add the *capacityVector* to the Scheduler response (XML/JSON).
>  - CapacitySchedulerQueueInfo.java
>  - PartitionQueueCapacitiesInfo.java
> The proposed format in the design doc (YARN-10888):
> {code:json}
> {
>"capacityVector": {
>   "memory-mb": "30%",
>   "vcores": "16"
>}
> }
> {code}
> {code:xml}
> 
> 
>   30%
>   16
> 
> {code}
> Unfortunately the current jsonProvider (MoxyJsonFeature or JettisonFeature 
> not sure) serialise map structures in the following way:
> {code:json}
> {
>   "capacityVector":{
> "entry":[
>   {
> "key":"memory-mb",
> "value":"12288"
>   },
>   {
> "key":"vcores",
> "value":"86%"
>   }
> ]
>   }
> }
> {code}
> {code:xml}
> 
> 
>
>   memory-mb
>   1288
>
>
>   vcores
>   12
>
> 
> {code}
> Based on some research with the following two dependencies we could achieve 
> the proposed format:
>  - jersey-media-json-jackson (this one is used in the apps catalog already)
>  - jackson-dataformat-xml
> Some concerns:
>  - 2 more dependencies
>  - for the XML when the content depends on the runtime content of the map is 
> not XSD friendly
>  - name is capacityVector but it's represented in a map
> An alternative could be to just store the capacityVector as a string, but 
> then clients needs to parse it, and it's not particularly nice either:
> {code:json}
> {
>"capacityVector": "[\"memory-mb\":  12288, \"vcores\":  86%]"
> }
> {code}
> {code:xml}
> 
> [memory-mb:  12288, vcores:  
> 86%]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11514) Extend SchedulerResponse with capacityVector

2023-08-25 Thread Benjamin Teke (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759092#comment-17759092
 ] 

Benjamin Teke commented on YARN-11514:
--

After some trials I suggest moving forward with the current jsonProvider. The 
ideal solution would be to replace Jettison with Jackson, as it would make 
adding new/custom fields way easier, but achieving exactly the same result 
would mean a lot of code change and even custom solutions, which isn't 
necessarily worth it. And since this is a public API we risk breaking other 
dependent components. So the current implementation looks like this:

 
{code:java}
"queueCapacityVectorInfo" : {
"configuredCapacityVector" : 
"[memory-mb=12.5%,vcores=12.5%]",
"capacityVectorEntries" : [ {
  "resourceName" : "memory-mb",
  "resourceValue" : "12.5%"
}, {
  "resourceName" : "vcores",
  "resourceValue" : "12.5%"
} ]
  },
{code}


> Extend SchedulerResponse with capacityVector
> 
>
> Key: YARN-11514
> URL: https://issues.apache.org/jira/browse/YARN-11514
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tamas Domok
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
>
> The goal is to add the *capacityVector* to the Scheduler response (XML/JSON).
>  - CapacitySchedulerQueueInfo.java
>  - PartitionQueueCapacitiesInfo.java
> The proposed format in the design doc (YARN-10888):
> {code:json}
> {
>"capacityVector": {
>   "memory-mb": "30%",
>   "vcores": "16"
>}
> }
> {code}
> {code:xml}
> 
> 
>   30%
>   16
> 
> {code}
> Unfortunately the current jsonProvider (MoxyJsonFeature or JettisonFeature 
> not sure) serialise map structures in the following way:
> {code:json}
> {
>   "capacityVector":{
> "entry":[
>   {
> "key":"memory-mb",
> "value":"12288"
>   },
>   {
> "key":"vcores",
> "value":"86%"
>   }
> ]
>   }
> }
> {code}
> {code:xml}
> 
> 
>
>   memory-mb
>   1288
>
>
>   vcores
>   12
>
> 
> {code}
> Based on some research with the following two dependencies we could achieve 
> the proposed format:
>  - jersey-media-json-jackson (this one is used in the apps catalog already)
>  - jackson-dataformat-xml
> Some concerns:
>  - 2 more dependencies
>  - for the XML when the content depends on the runtime content of the map is 
> not XSD friendly
>  - name is capacityVector but it's represented in a map
> An alternative could be to just store the capacityVector as a string, but 
> then clients needs to parse it, and it's not particularly nice either:
> {code:json}
> {
>"capacityVector": "[\"memory-mb\":  12288, \"vcores\":  86%]"
> }
> {code}
> {code:xml}
> 
> [memory-mb:  12288, vcores:  
> 86%]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11535) Remove jackson-dataformat-yaml dependency

2023-08-22 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11535.
--
  Assignee: Benjamin Teke  (was: Susheel Gupta)
Resolution: Fixed

> Remove jackson-dataformat-yaml dependency
> -
>
> Key: YARN-11535
> URL: https://issues.apache.org/jira/browse/YARN-11535
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: build, yarn
>Affects Versions: 3.4.0
>Reporter: Susheel Gupta
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: deps.txt
>
>
> Hadoop-project uses  
> [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198]
>  and 
> [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72].
> But internally jackson-dataformat-yaml-2.12.7 uses compile dependency 
> [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7]
>  .This may cause a transitive dependency issue in other services using hadoop 
> jar having jackson-dataformat-yaml-2.12.7 as  jackson-dataformat-yaml-2.12.7 
> will use nearest dependency available of snakeyaml i.e 1.27 and ignore the 
> version of snakeyaml-2.0 from hadoop-project. To overcome this and since 
> jackson-dataformat-yaml is not actually used it should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11535) Remove jackson-dataformat-yaml dependency

2023-08-22 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11535:
-
Summary: Remove jackson-dataformat-yaml dependency  (was: 
Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause transitive 
dependency issue with 2.12.7)

> Remove jackson-dataformat-yaml dependency
> -
>
> Key: YARN-11535
> URL: https://issues.apache.org/jira/browse/YARN-11535
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: build, yarn
>Affects Versions: 3.4.0
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: deps.txt
>
>
> Hadoop-project uses  
> [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198]
>  and 
> [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72].
> But internally jackson-dataformat-yaml-2.12.7 uses compile dependency 
> [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7]
>  .This may cause a transitive dependency issue in other services using hadoop 
> jar having jackson-dataformat-yaml-2.12.7 as  jackson-dataformat-yaml-2.12.7 
> will use nearest dependency available of snakeyaml i.e 1.27 and ignore the 
> version of snakeyaml-2.0 from hadoop-project.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11535) Remove jackson-dataformat-yaml dependency

2023-08-22 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11535:
-
Description: 
Hadoop-project uses  
[snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198]
 and 
[jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72].
But internally jackson-dataformat-yaml-2.12.7 uses compile dependency 
[snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7]
 .This may cause a transitive dependency issue in other services using hadoop 
jar having jackson-dataformat-yaml-2.12.7 as  jackson-dataformat-yaml-2.12.7 
will use nearest dependency available of snakeyaml i.e 1.27 and ignore the 
version of snakeyaml-2.0 from hadoop-project. To overcome this and since 
jackson-dataformat-yaml is not actually used it should be removed.

  was:
Hadoop-project uses  
[snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198]
 and 
[jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72].
But internally jackson-dataformat-yaml-2.12.7 uses compile dependency 
[snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7]
 .This may cause a transitive dependency issue in other services using hadoop 
jar having jackson-dataformat-yaml-2.12.7 as  jackson-dataformat-yaml-2.12.7 
will use nearest dependency available of snakeyaml i.e 1.27 and ignore the 
version of snakeyaml-2.0 from hadoop-project.


> Remove jackson-dataformat-yaml dependency
> -
>
> Key: YARN-11535
> URL: https://issues.apache.org/jira/browse/YARN-11535
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: build, yarn
>Affects Versions: 3.4.0
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: deps.txt
>
>
> Hadoop-project uses  
> [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198]
>  and 
> [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72].
> But internally jackson-dataformat-yaml-2.12.7 uses compile dependency 
> [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7]
>  .This may cause a transitive dependency issue in other services using hadoop 
> jar having jackson-dataformat-yaml-2.12.7 as  jackson-dataformat-yaml-2.12.7 
> will use nearest dependency available of snakeyaml i.e 1.27 and ignore the 
> version of snakeyaml-2.0 from hadoop-project. To overcome this and since 
> jackson-dataformat-yaml is not actually used it should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11535) Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause transitive dependency issue with 2.12.7

2023-08-21 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11535:
-
Attachment: deps.txt

> Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause 
> transitive dependency issue with 2.12.7
> 
>
> Key: YARN-11535
> URL: https://issues.apache.org/jira/browse/YARN-11535
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: build, yarn
>Affects Versions: 3.4.0
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: deps.txt
>
>
> Hadoop-project uses  
> [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198]
>  and 
> [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72].
> But internally jackson-dataformat-yaml-2.12.7 uses compile dependency 
> [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7]
>  .This may cause a transitive dependency issue in other services using hadoop 
> jar having jackson-dataformat-yaml-2.12.7 as  jackson-dataformat-yaml-2.12.7 
> will use nearest dependency available of snakeyaml i.e 1.27 and ignore the 
> version of snakeyaml-2.0 from hadoop-project.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11535) Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause transitive dependency issue with 2.12.7

2023-08-21 Thread Benjamin Teke (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756905#comment-17756905
 ] 

Benjamin Teke commented on YARN-11535:
--

Since AFAIK the only reason jackson-dataformat-yaml was pulled is to solve some 
dependency conflicts a safer way to resolve this issue is to remove 
jackson-dataformat-yaml and exclude it in the one place it's transitively 
imported. Uploaded the new dependency:tree output after the change. I've 
created a PR for this: https://github.com/apache/hadoop/pull/5970/

Side-note: the only usage of snakeyaml is in Apache Slider (a long retired 
project), to provide a possibility for writing its config to a YAML file, but 
without tests it a larger effort to refactor it. Since its usecase is a simple 
one, it's unlikely to break with upgrades.

> Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause 
> transitive dependency issue with 2.12.7
> 
>
> Key: YARN-11535
> URL: https://issues.apache.org/jira/browse/YARN-11535
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: build, yarn
>Affects Versions: 3.4.0
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: deps.txt
>
>
> Hadoop-project uses  
> [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198]
>  and 
> [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72].
> But internally jackson-dataformat-yaml-2.12.7 uses compile dependency 
> [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7]
>  .This may cause a transitive dependency issue in other services using hadoop 
> jar having jackson-dataformat-yaml-2.12.7 as  jackson-dataformat-yaml-2.12.7 
> will use nearest dependency available of snakeyaml i.e 1.27 and ignore the 
> version of snakeyaml-2.0 from hadoop-project.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11551) RM format-conf-store should delete all the content of ZKConfigStore

2023-08-17 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11551:


 Summary: RM format-conf-store should delete all the content of 
ZKConfigStore
 Key: YARN-11551
 URL: https://issues.apache.org/jira/browse/YARN-11551
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Benjamin Teke
Assignee: Benjamin Teke


To easily overcome the issue mentioned in YARN-11075 format-conf-store should 
delete everything under RM_SCHEDCONF_STORE_ZK_PARENT_PATH not just the 
CONF_STORE, because LOGS can contain elements that cannot be deserialized 
because of a missing serialVersionUID.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11416) FS2CS should use CapacitySchedulerConfiguration in FSQueueConverterBuilder

2023-08-04 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11416.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> FS2CS should use CapacitySchedulerConfiguration in FSQueueConverterBuilder 
> ---
>
> Key: YARN-11416
> URL: https://issues.apache.org/jira/browse/YARN-11416
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Benjamin Teke
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.FSQueueConverter
>  and its builder stores the variable capacitySchedulerConfig as a simple 
> Configuration object instead of CapacitySchedulerConfiguration. This is 
> misleading, as capacitySchedulerConfig suggests that it is indeed a 
> CapacitySchedulerConfiguration and it loses access to the convenience methods 
> to check for various properties. Because of this every time a property getter 
> is changed FS2CS should be checked if it reimplemented the same, otherwise 
> there might be behaviour differences or even bugs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11535) Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause transitive dependency issue with 2.12.7

2023-08-03 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11535.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause 
> transitive dependency issue with 2.12.7
> 
>
> Key: YARN-11535
> URL: https://issues.apache.org/jira/browse/YARN-11535
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: build, yarn
>Affects Versions: 3.4.0
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Hadoop-project uses  
> [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198]
>  and 
> [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72].
> But internally jackson-dataformat-yaml-2.12.7 uses compile dependency 
> [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7]
>  .This may cause a transitive dependency issue in other services using hadoop 
> jar having jackson-dataformat-yaml-2.12.7 as  jackson-dataformat-yaml-2.12.7 
> will use nearest dependency available of snakeyaml i.e 1.27 and ignore the 
> version of snakeyaml-2.0 from hadoop-project.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11545) FS2CS not converts ACLs when all users are allowed

2023-08-03 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11545.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> FS2CS not converts ACLs when all users are allowed
> --
>
> Key: YARN-11545
> URL: https://issues.apache.org/jira/browse/YARN-11545
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Peter Szucs
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Currently we only convert ACLs if users or groups are set. This should be 
> extended to check if the "allAllowed" flag is set in the AcessControlList to 
> be able to preserve * values also for the ACLs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11543) Fix checkstyle issues after YARN-11520

2023-07-27 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11543:
-
Summary: Fix checkstyle issues after YARN-11520  (was: Fix checkstyle after 
YARN-11520)

> Fix checkstyle issues after YARN-11520
> --
>
> Key: YARN-11543
> URL: https://issues.apache.org/jira/browse/YARN-11543
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>
> YARN-11520 missed some checkstyle fixes, they should resolved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11543) Fix checkstyle after YARN-11520

2023-07-27 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11543:


 Summary: Fix checkstyle after YARN-11520
 Key: YARN-11543
 URL: https://issues.apache.org/jira/browse/YARN-11543
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke
Assignee: Benjamin Teke


YARN-11520 missed some checkstyle fixes, they should resolved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11521) Create a test set that runs with both Legacy/Uniform queue calculation

2023-07-27 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11521:
-
Fix Version/s: 3.4.0

> Create a test set that runs with both Legacy/Uniform queue calculation
> --
>
> Key: YARN-11521
> URL: https://issues.apache.org/jira/browse/YARN-11521
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 3.4.0
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Follow-up ticket for YARN-11000.
> The JSON assert tests in TestRMWebServicesCapacitySchedDynamicConfig are a 
> good candidate for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11534) Incorrect exception handling in RecoveredContainerLaunch

2023-07-21 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11534.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Incorrect exception handling in RecoveredContainerLaunch
> 
>
> Key: YARN-11534
> URL: https://issues.apache.org/jira/browse/YARN-11534
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Peter Szucs
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When NM is restarted during a container recovery, it can happen that it 
> interrupts the container reaquisition during the LinuxContainerExecutor's 
> signalContainer method. In this case we will get the following exception:
> {code:java}
> java.io.InterruptedIOException: java.lang.InterruptedException
>     at org.apache.hadoop.util.Shell.runCommand(Shell.java:1011)
>     at org.apache.hadoop.util.Shell.run(Shell.java:901)
>     at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:177)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.InterruptedException
>     at java.base/java.lang.Object.wait(Native Method)
>     at java.base/java.lang.Object.wait(Object.java:328)
>     at java.base/java.lang.ProcessImpl.waitFor(ProcessImpl.java:495)
>     at org.apache.hadoop.util.Shell.runCommand(Shell.java:1001)
>     ... 15 more{code}
> Later this InterruptedIOException get caught and wrapped inside a 
> PrivilegedOperationException and a ContainerExecutionException. In 
> LinuxContainerExecutor's 
> [signalContainer|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java#L790]
>  method we catch this exception again, and throw an IOException from it, 
> indicating this error message in the stack trace:
> {code:java}
> IOException from it, causing the following stack trace:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
>  Signal container failed
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:183)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
>     at 
>

[jira] [Created] (YARN-11539) Flexible AQC: setting capacity with leaf-template doesn't work

2023-07-20 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11539:


 Summary: Flexible AQC: setting capacity with leaf-template doesn't 
work
 Key: YARN-11539
 URL: https://issues.apache.org/jira/browse/YARN-11539
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.4.0
Reporter: Benjamin Teke
Assignee: Benjamin Teke


In 
[AbstractCSQueue.setDynamicQueueProperties|https://github.com/apache/hadoop/blob/193ff1c24e55938f42cb8ca12da754a2636f62a7/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L451]
 we're always calling 
[AutoCreatedQueueTemplate.setTemplateEntriesForChild|https://github.com/apache/hadoop/blob/bf570bd4acd9ebccada80a68aa1c5fbf73ca60bf/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java#L107]
 with isLeaf false, hence leaf templates like capacity won't be added to the 
dynamic queues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11514) Extend SchedulerResponse with capacityVector

2023-07-20 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke reassigned YARN-11514:


Assignee: Benjamin Teke

> Extend SchedulerResponse with capacityVector
> 
>
> Key: YARN-11514
> URL: https://issues.apache.org/jira/browse/YARN-11514
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tamas Domok
>Assignee: Benjamin Teke
>Priority: Major
>
> The goal is to add the *capacityVector* to the Scheduler response (XML/JSON).
>  - CapacitySchedulerQueueInfo.java
>  - PartitionQueueCapacitiesInfo.java
> The proposed format in the design doc (YARN-10888):
> {code:json}
> {
>"capacityVector": {
>   "memory-mb": "30%",
>   "vcores": "16"
>}
> }
> {code}
> {code:xml}
> 
> 
>   30%
>   16
> 
> {code}
> Unfortunately the current jsonProvider (MoxyJsonFeature or JettisonFeature 
> not sure) serialise map structures in the following way:
> {code:json}
> {
>   "capacityVector":{
> "entry":[
>   {
> "key":"memory-mb",
> "value":"12288"
>   },
>   {
> "key":"vcores",
> "value":"86%"
>   }
> ]
>   }
> }
> {code}
> {code:xml}
> 
> 
>
>   memory-mb
>   1288
>
>
>   vcores
>   12
>
> 
> {code}
> Based on some research with the following two dependencies we could achieve 
> the proposed format:
>  - jersey-media-json-jackson (this one is used in the apps catalog already)
>  - jackson-dataformat-xml
> Some concerns:
>  - 2 more dependencies
>  - for the XML when the content depends on the runtime content of the map is 
> not XSD friendly
>  - name is capacityVector but it's represented in a map
> An alternative could be to just store the capacityVector as a string, but 
> then clients needs to parse it, and it's not particularly nice either:
> {code:json}
> {
>"capacityVector": "[\"memory-mb\":  12288, \"vcores\":  86%]"
> }
> {code}
> {code:xml}
> 
> [memory-mb:  12288, vcores:  
> 86%]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11520) Support capacity vector for AQCv2 dynamic templates

2023-07-20 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke reassigned YARN-11520:


Assignee: Benjamin Teke

> Support capacity vector for AQCv2 dynamic templates
> ---
>
> Key: YARN-11520
> URL: https://issues.apache.org/jira/browse/YARN-11520
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 3.4.0
>Reporter: Tamas Domok
>Assignee: Benjamin Teke
>Priority: Major
>
> AQCv2 dynamic queue templates should support the new capacity modes.
> e.g.:
> {code}
> auto-queue-creation-v2.parent-template.capacity = [memory=12288, vcores=86%]
> auto-queue-creation-v2.leaf-template.capacity = [memory=1w, vcores=1]
> auto-queue-creation-v2.template.capacity = [memory=20%, vcores=50%]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11533) CapacityScheduler CapacityConfigType changed in legacy queue allocation mode

2023-07-19 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11533:
-
Fix Version/s: 3.4.0

> CapacityScheduler CapacityConfigType changed in legacy queue allocation mode
> 
>
> Key: YARN-11533
> URL: https://issues.apache.org/jira/browse/YARN-11533
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.4.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> YARN-11000 changed the CapacityConfigType determination method in legacy 
> queue mode, which has some undesired effects, like marking root to be in 
> percentage mode, when the rest of the queue structure is in absolute.
> Config:
> {code:java}
> 
>   
>   yarn.scheduler.capacity.root.maximum-capacity
>   [memory=40960,vcores=32]
>   
>   
>   yarn.scheduler.capacity.root.queues
>   default
>   
>   
>   yarn.scheduler.capacity.root.capacity
>   [memory=40960,vcores=32]
>   
>   
>   
> yarn.scheduler.capacity.schedule-asynchronously.enable
>   true
>   
>   
>   
> yarn.scheduler.capacity.root.default.maximum-capacity
>   [memory=40960,vcores=32]
>   
>   
>   
> yarn.scheduler.capacity.root.default.maximum-am-resource-percent
>   0.2
>   
>   
>   yarn.scheduler.capacity.root.default.capacity
>   [memory=40960,vcores=32]
>   
> 
> {code}
> Old response:
> {code:java}
> {
> "scheduler": {
> "schedulerInfo": {
> "type": "capacityScheduler",
> "capacity": 100.0,
> "usedCapacity": 0.0,
> "maxCapacity": 100.0,
> "weight": -1.0,
> "normalizedWeight": 0.0,
> "queueName": "root",
> "queuePath": "root",
> "maxParallelApps": 2147483647,
> "isAbsoluteResource": true,
> "queues": {...}
> "queuePriority": 0,
> "orderingPolicyInfo": "utilization",
> "mode": "absolute",
> "queueType": "parent",
> "creationMethod": "static",
> "autoCreationEligibility": "off",
> "autoQueueTemplateProperties": {},
> "autoQueueParentTemplateProperties": {},
> "autoQueueLeafTemplateProperties": {}
> }
> }
> }
> {code}
> New response:
> {code:java}
> {
> "scheduler": {
> "schedulerInfo": {
> "type": "capacityScheduler",
> "capacity": 100.0,
> "usedCapacity": 0.0,
> "maxCapacity": 100.0,
> "weight": -1.0,
> "normalizedWeight": 0.0,
> "queueName": "root",
> "queuePath": "root",
> "maxParallelApps": 2147483647,
> "isAbsoluteResource": false,
> "queues": {...}
> "queuePriority": 0,
> "orderingPolicyInfo": "utilization",
> "mode": "percentage",
> "queueType": "parent",
> "creationMethod": "static",
> "autoCreationEligibility": "off",
> "autoQueueTemplateProperties": {},
> "autoQueueParentTemplateProperties": {},
> "autoQueueLeafTemplateProperties": {}
> }
> }
> }
> {code}
> This is misleading and has some side-effects on various clients.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9877) Intermittent TIME_OUT of LogAggregationReport

2023-07-19 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-9877:

Fix Version/s: 3.4.0

> Intermittent TIME_OUT of LogAggregationReport
> -
>
> Key: YARN-9877
> URL: https://issues.apache.org/jira/browse/YARN-9877
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager, yarn
>Affects Versions: 3.0.3, 3.3.0, 3.2.1, 3.1.3
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-9877.001.patch
>
>
> I noticed some intermittent TIME_OUT in some downstream log-aggregation based 
> tests.
> Steps to reproduce:
> - Let's run a MR job
> {code}
> hadoop jar hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep 
> -Dmapreduce.job.queuename=root.default -m 10 -r 10 -mt 5000 -rt 5000
> {code}
> - Suppose the AM is requesting more containers, but as soon as they're 
> allocated - the AM realizes it doesn't need them. The container's state 
> changes are: ALLOCATED -> ACQUIRED -> RELEASED. 
> Let's suppose these extra containers are allocated in a different node from 
> the other 21 (AM + 10 mapper + 10 reducer) containers' node.
> - All the containers finish successfully and the app is finished successfully 
> as well. Log aggregation status for the whole app seemingly stucks in RUNNING 
> state.
> - After a while the final log aggregation status for the app changes to 
> TIME_OUT.
> Root cause:
> - As unused containers are getting through the state transition in the RM's 
> internal representation, {{RMAppImpl$AppRunningOnNodeTransition}}'s 
> transition function is called. This calls the 
> {{RMAppLogAggregation$addReportIfNecessary}} which forcefully adds the 
> "NOT_START" LogAggregationStatus associated with this NodeId for the app, 
> even though it does not have any running container on it.
> - The node's LogAggregationStatus is never updated to "SUCCEEDED" by the 
> NodeManager because it does not have any running container on it (Note that 
> the AM immediately released them after acquisition). The LogAggregationStatus 
> remains NOT_START until time out is reached. After that point the RM 
> aggregates the LogAggregationReports for all the nodes, and though all the 
> containers have SUCCEEDED state, one particular node has NOT_START, so the 
> final log aggregation will be TIME_OUT.
> (I crawled the RM UI for the log aggregation statuses, and it was always 
> NOT_START for this particular node).
> This situation is highly unlikely, but has an estimated ~0.8% of failure rate 
> based on a year's 1500 run on an unstressed cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11533) CapacityScheduler CapacityConfigType changed in legacy queue allocation mode

2023-07-17 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11533:
-
Parent: YARN-10888
Issue Type: Sub-task  (was: Bug)

> CapacityScheduler CapacityConfigType changed in legacy queue allocation mode
> 
>
> Key: YARN-11533
> URL: https://issues.apache.org/jira/browse/YARN-11533
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.4.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>
> YARN-11000 changed the CapacityConfigType determination method in legacy 
> queue mode, which has some undesired effects, like marking root to be in 
> percentage mode, when the rest of the queue structure is in absolute.
> Config:
> {code:java}
> 
>   
>   yarn.scheduler.capacity.root.maximum-capacity
>   [memory=40960,vcores=32]
>   
>   
>   yarn.scheduler.capacity.root.queues
>   default
>   
>   
>   yarn.scheduler.capacity.root.capacity
>   [memory=40960,vcores=32]
>   
>   
>   
> yarn.scheduler.capacity.schedule-asynchronously.enable
>   true
>   
>   
>   
> yarn.scheduler.capacity.root.default.maximum-capacity
>   [memory=40960,vcores=32]
>   
>   
>   
> yarn.scheduler.capacity.root.default.maximum-am-resource-percent
>   0.2
>   
>   
>   yarn.scheduler.capacity.root.default.capacity
>   [memory=40960,vcores=32]
>   
> 
> {code}
> Old response:
> {code:java}
> {
> "scheduler": {
> "schedulerInfo": {
> "type": "capacityScheduler",
> "capacity": 100.0,
> "usedCapacity": 0.0,
> "maxCapacity": 100.0,
> "weight": -1.0,
> "normalizedWeight": 0.0,
> "queueName": "root",
> "queuePath": "root",
> "maxParallelApps": 2147483647,
> "isAbsoluteResource": true,
> "queues": {...}
> "queuePriority": 0,
> "orderingPolicyInfo": "utilization",
> "mode": "absolute",
> "queueType": "parent",
> "creationMethod": "static",
> "autoCreationEligibility": "off",
> "autoQueueTemplateProperties": {},
> "autoQueueParentTemplateProperties": {},
> "autoQueueLeafTemplateProperties": {}
> }
> }
> }
> {code}
> New response:
> {code:java}
> {
> "scheduler": {
> "schedulerInfo": {
> "type": "capacityScheduler",
> "capacity": 100.0,
> "usedCapacity": 0.0,
> "maxCapacity": 100.0,
> "weight": -1.0,
> "normalizedWeight": 0.0,
> "queueName": "root",
> "queuePath": "root",
> "maxParallelApps": 2147483647,
> "isAbsoluteResource": false,
> "queues": {...}
> "queuePriority": 0,
> "orderingPolicyInfo": "utilization",
> "mode": "percentage",
> "queueType": "parent",
> "creationMethod": "static",
> "autoCreationEligibility": "off",
> "autoQueueTemplateProperties": {},
> "autoQueueParentTemplateProperties": {},
> "autoQueueLeafTemplateProperties": {}
> }
> }
> }
> {code}
> This is misleading and has some side-effects on various clients.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11533) CapacityScheduler CapacityConfigType changed in legacy queue allocation mode

2023-07-17 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11533:


 Summary: CapacityScheduler CapacityConfigType changed in legacy 
queue allocation mode
 Key: YARN-11533
 URL: https://issues.apache.org/jira/browse/YARN-11533
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Affects Versions: 3.4.0
Reporter: Benjamin Teke
Assignee: Benjamin Teke


YARN-11000 changed the CapacityConfigType determination method in legacy queue 
mode, which has some undesired effects, like marking root to be in percentage 
mode, when the rest of the queue structure is in absolute.

Config:

{code:java}


yarn.scheduler.capacity.root.maximum-capacity
[memory=40960,vcores=32]


yarn.scheduler.capacity.root.queues
default


yarn.scheduler.capacity.root.capacity
[memory=40960,vcores=32]



yarn.scheduler.capacity.schedule-asynchronously.enable
true



yarn.scheduler.capacity.root.default.maximum-capacity
[memory=40960,vcores=32]



yarn.scheduler.capacity.root.default.maximum-am-resource-percent
0.2


yarn.scheduler.capacity.root.default.capacity
[memory=40960,vcores=32]


{code}

Old response:

{code:java}
{
"scheduler": {
"schedulerInfo": {
"type": "capacityScheduler",
"capacity": 100.0,
"usedCapacity": 0.0,
"maxCapacity": 100.0,
"weight": -1.0,
"normalizedWeight": 0.0,
"queueName": "root",
"queuePath": "root",
"maxParallelApps": 2147483647,
"isAbsoluteResource": true,
"queues": {...}
"queuePriority": 0,
"orderingPolicyInfo": "utilization",
"mode": "absolute",
"queueType": "parent",
"creationMethod": "static",
"autoCreationEligibility": "off",
"autoQueueTemplateProperties": {},
"autoQueueParentTemplateProperties": {},
"autoQueueLeafTemplateProperties": {}
}
}
}
{code}

New response:

{code:java}
{
"scheduler": {
"schedulerInfo": {
"type": "capacityScheduler",
"capacity": 100.0,
"usedCapacity": 0.0,
"maxCapacity": 100.0,
"weight": -1.0,
"normalizedWeight": 0.0,
"queueName": "root",
"queuePath": "root",
"maxParallelApps": 2147483647,
"isAbsoluteResource": false,
"queues": {...}
"queuePriority": 0,
"orderingPolicyInfo": "utilization",
"mode": "percentage",
"queueType": "parent",
"creationMethod": "static",
"autoCreationEligibility": "off",
"autoQueueTemplateProperties": {},
"autoQueueParentTemplateProperties": {},
"autoQueueLeafTemplateProperties": {}
}
}
}
{code}

This is misleading and has some side-effects on various clients.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11464) TestFSQueueConverter#testAutoCreateV2FlagsInWeightMode has a missing dot before auto-queue-creation-v2.enabled for method call assertNoValueForQueues

2023-07-11 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11464.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

> TestFSQueueConverter#testAutoCreateV2FlagsInWeightMode has a missing dot 
> before auto-queue-creation-v2.enabled for method call assertNoValueForQueues
> -
>
> Key: YARN-11464
> URL: https://issues.apache.org/jira/browse/YARN-11464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.4
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> This testcase clearly reproduces the issue. There is a missing dot before 
> "auto-queue-creation-v2.enabled" for method call assertNoValueForQueues.
> {code:java}
> @Test
> public void testAutoCreateV2FlagsInWeightMode() {
>   converter = builder.withPercentages(false).build();
>   converter.convertQueueHierarchy(rootQueue);
>   assertTrue("root autocreate v2 flag",
>   csConfig.getBoolean(
>   PREFIX + "root.auto-queue-creation-v2.enabled", false));
>   assertTrue("root.admins autocreate v2 flag",
>   csConfig.getBoolean(
>   PREFIX + "root.admins.auto-queue-creation-v2.enabled", false));
>   assertTrue("root.users autocreate v2 flag",
>   csConfig.getBoolean(
>   PREFIX + "root.users.auto-queue-creation-v2.enabled", false));
>   assertTrue("root.misc autocreate v2 flag",
>   csConfig.getBoolean(
>   PREFIX + "root.misc.auto-queue-creation-v2.enabled", false));
>   Set leafs = Sets.difference(ALL_QUEUES,
>   Sets.newHashSet("root",
>   "root.default",
>   "root.admins",
>   "root.users",
>   "root.misc"));
>   assertNoValueForQueues(leafs, "auto-queue-creation-v2.enabled",
>   csConfig);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11513) Applications submitted to ambiguous queue fail during recovery if "Specified" Placement Rule is used

2023-06-23 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11513.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Applications submitted to ambiguous queue fail during recovery if "Specified" 
> Placement Rule is used
> 
>
> Key: YARN-11513
> URL: https://issues.apache.org/jira/browse/YARN-11513
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.4
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When an app is submitted to an ambiguous queue using the full queue path and 
> is placed in that pool via a {{%specified}} mapping Placement Rule, the queue 
> in the stored ApplicationSubmissionContext will be the short name for the 
> queue. During recovery from an RM failover, the placement rule will be 
> evaluated using the stored short name of the queue, resulting in the RM 
> killing the app as it cannot resolve the ambiguous queue name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11498) Exclude Jettison from jersey-json artifact in hadoop-yarn-common's pom.xml

2023-06-23 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11498.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Exclude Jettison from jersey-json artifact in hadoop-yarn-common's pom.xml
> --
>
> Key: YARN-11498
> URL: https://issues.apache.org/jira/browse/YARN-11498
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: build
>Reporter: Devaspati Krishnatri
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> This exclusion is done to reduce CVEs present due to an older version of 
> Jettison(1.1) being pulled in with jersey-json artifact.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11511) Improve TestRMWebServices test config and data

2023-06-21 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11511.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Improve TestRMWebServices test config and data
> --
>
> Key: YARN-11511
> URL: https://issues.apache.org/jira/browse/YARN-11511
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 3.4.0
>Reporter: Tamas Domok
>Assignee: Bence Kosztolnik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> We found multiple configuration issues with 
> *TestRMWebServicesCapacitySched.java* and 
> *TestRMWebServicesCapacitySchedDynamicConfig.java*.
> h3. 1. createMockRM created the RM with a non-intuitive queue config 
> (createMockRM was used from the TestRMWebServicesCapacitySchedDynamicConfig 
> where this was not expected)
> Fix:
> {code}
> diff --git 
> a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesCapacitySched.java
>  
> b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesCapacitySched.java
> index ec65237fa6e..378f16e981a 100644
> --- 
> a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesCapacitySched.java
> +++ 
> b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesCapacitySched.java
> @@ -108,13 +108,13 @@ public TestRMWebServicesCapacitySched() {
>@Override
>public void setUp() throws Exception {
>  super.setUp();
> -rm = createMockRM(new CapacitySchedulerConfiguration(
> -new Configuration(false)));
> +rm = createMockRM(setupQueueConfiguration(new 
> CapacitySchedulerConfiguration(
> +new Configuration(false;
>  GuiceServletConfig.setInjector(
>  Guice.createInjector(new WebServletModule(rm)));
>}
> -  public static void setupQueueConfiguration(
> +  public CapacitySchedulerConfiguration setupQueueConfiguration(
>CapacitySchedulerConfiguration config) {
>  // Define top-level queues
> @@ -167,6 +167,8 @@ public static void setupQueueConfiguration(
>  config.setAutoCreateChildQueueEnabled(a1C, true);
>  config.setInt(PREFIX + a1C + DOT + 
> AUTO_CREATED_LEAF_QUEUE_TEMPLATE_PREFIX
>  + DOT + CAPACITY, 50);
> +
> +return config;
>}
>@Test
> @@ -407,7 +409,6 @@ public static WebAppDescriptor createWebAppDescriptor() {
>}
>public static MockRM createMockRM(CapacitySchedulerConfiguration csConf) {
> -setupQueueConfiguration(csConf);
>  YarnConfiguration conf = new YarnConfiguration(csConf);
>  conf.setClass(YarnConfiguration.RM_SCHEDULER, CapacityScheduler.class,
>  ResourceScheduler.class);
> {code}
> h3. 2. setupQueueConfiguration creates a mixed queue hierarchy (percentage 
> and absolute)
> {code}
> final String c = CapacitySchedulerConfiguration.ROOT + ".c";
> config.setCapacity(c, "[memory=1024]");
> {code}
> root.c is configured in absolute mode while root.a and root.b are configured 
> in percentage
> setupQueueConfiguration should be simplified, do the configuration like the 
> other tests (create a map with the string key value pairs)
> h3. 3. createAbsoluteConfigLegacyAutoCreation does not set capacity for the 
> default queue
> That makes it mixed (percentage + absolute)
> h3. 4. initAutoQueueHandler is called with wrong units
> The * GB is unnecessary, and the vcores should be configured too with a value 
> that makes sense.
> h3. 5. CSConfigGenerator static class makes the tests hard to read.
> The test cases should just have their configuration assembled in them.
> h3. 6. testSchedulerResponseAbsoluteMode does not add any node
> No cluster resource -> no effectiveMin/Max resource.
> h1. Proposal
> These tests need a rework, the configurations should be easy to follow and 
> the calculated effectiveMin/Max (and any other calculated value) should 
> result in reasonable numbers. The goal is to have a end to end like test 
> suite that verifies the queue hierarchy initialisation.
> The queue hierarchies should be simple but at least 2 level, e.g.:
> root.default
> root.test_1.test_1_1
> root.test_1.test_1_2
> root.test_2
> The helper methods could be moved to a separate utility class from 
> TestRMWebServicesCapacitySched.
> TestRMWebServicesCapacitySched can be kept for the basic tests

[jira] [Created] (YARN-11503) Adding queues separately in short succession with Mutation API will stop CS allocating new containers

2023-05-31 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11503:


 Summary: Adding queues separately in short succession with 
Mutation API will stop CS allocating new containers
 Key: YARN-11503
 URL: https://issues.apache.org/jira/browse/YARN-11503
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Affects Versions: 3.4.0
Reporter: Benjamin Teke


Adding multiple queues in short succession via Mutation API will result in some 
race condition when adding the partition metrics for those queues, as noted by 
the unhandled exception:

{code:java}
2023-05-09 18:25:36,301 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Initializing root.eca_m
2023-05-09 18:25:36,301 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager:
 Initialized queue: root.eca_m
2023-05-09 18:25:36,359 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 LeafQueue:root.eca_mupdate max app related, maxApplications=1000, 
maxApplicationsPerUser=1000, Abs Cap:0.0, Cap: 0.0, MaxCap : 1.0
2023-05-09 18:25:36,359 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 LeafQueue:root.eca_mupdate max app related, maxApplications=1000, 
maxApplicationsPerUser=1000, Abs Cap:NaN, Cap: NaN, MaxCap : NaN
2023-05-09 18:25:36,401 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Initializing root.eca_m
2023-05-09 18:25:36,401 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager:
 Initialized queue: root.eca_m
2023-05-09 18:25:36,484 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-26,5,main] threw an Exception.
org.apache.hadoop.metrics2.MetricsException: Metrics source 
PartitionQueueMetrics,partition=,q0=root,q1=eca_m already exists!
2023-05-09 18:25:36,531 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Initializing root.eca_m
2023-05-09 18:25:36,531 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
root: re-configured queue: root.eca_m: capacity=0.0, absoluteCapacity=0.0, 
usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=0, numContainers=0, effectiveMinResource= , 
effectiveMaxResource=
{code}

Initing the leaf queue root.eca_m should only happen once in during a reinit 
(twice if the validation endpoint is used), but in this case it happened thrice 
under a quarter of a second. This results in an unhandled exception in the 
async scheduling thread, which then will block new container allocation 
(existing ones can transition to other states however).

{code:java}
2023-05-09 18:25:36,484 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-26,5,main] threw an Exception.
org.apache.hadoop.metrics2.MetricsException: Metrics source 
PartitionQueueMetrics,partition=,q0=root,q1=eca_m already exists!
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:355)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToUser(QueueMetrics.java:614)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1545)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1198)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1109)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:927)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
{code}

Even though

[jira] [Updated] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade

2023-05-15 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11312:
-
Fix Version/s: 3.2.5
   3.3.6

> [UI2] Refresh buttons don't work after EmberJS upgrade
> --
>
> Key: YARN-11312
> URL: https://issues.apache.org/jira/browse/YARN-11312
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Brian Goerlitz
>Assignee: Susheel Gupta
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.5, 3.3.6
>
>
> After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh 
> buttons do not work anymore. The following error is thrown in the Chrome 
> console, but other browsers also fail.
> {noformat}
> yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined 
> (reading 'send')
>     at Class.refresh (yarn-ui.js:38:311)
>     at Class.send (vendor.js:2504:107)
>     at Class.superWrapper [as send] (vendor.js:1875:112)
>     at vendor.js:1165:144
>     at Object.flaggedInstrument (vendor.js:1583:187)
>     at runRegisteredAction (vendor.js:1165:68)
>     at Backburner.run (vendor.js:738:228)
>     at Object.run [as default] (vendor.js:1840:517)
>     at Object.handler (vendor.js:1164:178)
>     at HTMLButtonElement. (vendor.js:2534:128){noformat}
> Downgrading the ember version to 2.7.0 seems to resolve the issue, but this 
> also requires a jquery downgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade

2023-05-15 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11312.
--
Resolution: Fixed

> [UI2] Refresh buttons don't work after EmberJS upgrade
> --
>
> Key: YARN-11312
> URL: https://issues.apache.org/jira/browse/YARN-11312
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Brian Goerlitz
>Assignee: Susheel Gupta
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.5, 3.3.6
>
>
> After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh 
> buttons do not work anymore. The following error is thrown in the Chrome 
> console, but other browsers also fail.
> {noformat}
> yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined 
> (reading 'send')
>     at Class.refresh (yarn-ui.js:38:311)
>     at Class.send (vendor.js:2504:107)
>     at Class.superWrapper [as send] (vendor.js:1875:112)
>     at vendor.js:1165:144
>     at Object.flaggedInstrument (vendor.js:1583:187)
>     at runRegisteredAction (vendor.js:1165:68)
>     at Backburner.run (vendor.js:738:228)
>     at Object.run [as default] (vendor.js:1840:517)
>     at Object.handler (vendor.js:1164:178)
>     at HTMLButtonElement. (vendor.js:2534:128){noformat}
> Downgrading the ember version to 2.7.0 seems to resolve the issue, but this 
> also requires a jquery downgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade

2023-05-12 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11312:
-
Target Version/s: 3.4.0, 3.2.5, 3.3.6  (was: 3.4.0)

> [UI2] Refresh buttons don't work after EmberJS upgrade
> --
>
> Key: YARN-11312
> URL: https://issues.apache.org/jira/browse/YARN-11312
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Brian Goerlitz
>Assignee: Susheel Gupta
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh 
> buttons do not work anymore. The following error is thrown in the Chrome 
> console, but other browsers also fail.
> {noformat}
> yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined 
> (reading 'send')
>     at Class.refresh (yarn-ui.js:38:311)
>     at Class.send (vendor.js:2504:107)
>     at Class.superWrapper [as send] (vendor.js:1875:112)
>     at vendor.js:1165:144
>     at Object.flaggedInstrument (vendor.js:1583:187)
>     at runRegisteredAction (vendor.js:1165:68)
>     at Backburner.run (vendor.js:738:228)
>     at Object.run [as default] (vendor.js:1840:517)
>     at Object.handler (vendor.js:1164:178)
>     at HTMLButtonElement. (vendor.js:2534:128){noformat}
> Downgrading the ember version to 2.7.0 seems to resolve the issue, but this 
> also requires a jquery downgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade

2023-05-12 Thread Benjamin Teke (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722190#comment-17722190
 ] 

Benjamin Teke commented on YARN-11312:
--

Reopening for 3.2 and 3.3 backports.

> [UI2] Refresh buttons don't work after EmberJS upgrade
> --
>
> Key: YARN-11312
> URL: https://issues.apache.org/jira/browse/YARN-11312
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Brian Goerlitz
>Assignee: Susheel Gupta
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh 
> buttons do not work anymore. The following error is thrown in the Chrome 
> console, but other browsers also fail.
> {noformat}
> yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined 
> (reading 'send')
>     at Class.refresh (yarn-ui.js:38:311)
>     at Class.send (vendor.js:2504:107)
>     at Class.superWrapper [as send] (vendor.js:1875:112)
>     at vendor.js:1165:144
>     at Object.flaggedInstrument (vendor.js:1583:187)
>     at runRegisteredAction (vendor.js:1165:68)
>     at Backburner.run (vendor.js:738:228)
>     at Object.run [as default] (vendor.js:1840:517)
>     at Object.handler (vendor.js:1164:178)
>     at HTMLButtonElement. (vendor.js:2534:128){noformat}
> Downgrading the ember version to 2.7.0 seems to resolve the issue, but this 
> also requires a jquery downgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Reopened] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade

2023-05-12 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke reopened YARN-11312:
--

> [UI2] Refresh buttons don't work after EmberJS upgrade
> --
>
> Key: YARN-11312
> URL: https://issues.apache.org/jira/browse/YARN-11312
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Brian Goerlitz
>Assignee: Susheel Gupta
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh 
> buttons do not work anymore. The following error is thrown in the Chrome 
> console, but other browsers also fail.
> {noformat}
> yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined 
> (reading 'send')
>     at Class.refresh (yarn-ui.js:38:311)
>     at Class.send (vendor.js:2504:107)
>     at Class.superWrapper [as send] (vendor.js:1875:112)
>     at vendor.js:1165:144
>     at Object.flaggedInstrument (vendor.js:1583:187)
>     at runRegisteredAction (vendor.js:1165:68)
>     at Backburner.run (vendor.js:738:228)
>     at Object.run [as default] (vendor.js:1840:517)
>     at Object.handler (vendor.js:1164:178)
>     at HTMLButtonElement. (vendor.js:2534:128){noformat}
> Downgrading the ember version to 2.7.0 seems to resolve the issue, but this 
> also requires a jquery downgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade

2023-05-12 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11312.
--
   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s: 3.4.0
  Resolution: Fixed

> [UI2] Refresh buttons don't work after EmberJS upgrade
> --
>
> Key: YARN-11312
> URL: https://issues.apache.org/jira/browse/YARN-11312
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Brian Goerlitz
>Assignee: Susheel Gupta
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh 
> buttons do not work anymore. The following error is thrown in the Chrome 
> console, but other browsers also fail.
> {noformat}
> yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined 
> (reading 'send')
>     at Class.refresh (yarn-ui.js:38:311)
>     at Class.send (vendor.js:2504:107)
>     at Class.superWrapper [as send] (vendor.js:1875:112)
>     at vendor.js:1165:144
>     at Object.flaggedInstrument (vendor.js:1583:187)
>     at runRegisteredAction (vendor.js:1165:68)
>     at Backburner.run (vendor.js:738:228)
>     at Object.run [as default] (vendor.js:1840:517)
>     at Object.handler (vendor.js:1164:178)
>     at HTMLButtonElement. (vendor.js:2534:128){noformat}
> Downgrading the ember version to 2.7.0 seems to resolve the issue, but this 
> also requires a jquery downgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade

2023-05-12 Thread Benjamin Teke (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722154#comment-17722154
 ] 

Benjamin Teke commented on YARN-11312:
--

Background info on the change: https://github.com/emberjs/ember.js/issues/14168

> [UI2] Refresh buttons don't work after EmberJS upgrade
> --
>
> Key: YARN-11312
> URL: https://issues.apache.org/jira/browse/YARN-11312
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Brian Goerlitz
>Assignee: Susheel Gupta
>Priority: Minor
>  Labels: pull-request-available
>
> After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh 
> buttons do not work anymore. The following error is thrown in the Chrome 
> console, but other browsers also fail.
> {noformat}
> yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined 
> (reading 'send')
>     at Class.refresh (yarn-ui.js:38:311)
>     at Class.send (vendor.js:2504:107)
>     at Class.superWrapper [as send] (vendor.js:1875:112)
>     at vendor.js:1165:144
>     at Object.flaggedInstrument (vendor.js:1583:187)
>     at runRegisteredAction (vendor.js:1165:68)
>     at Backburner.run (vendor.js:738:228)
>     at Object.run [as default] (vendor.js:1840:517)
>     at Object.handler (vendor.js:1164:178)
>     at HTMLButtonElement. (vendor.js:2534:128){noformat}
> Downgrading the ember version to 2.7.0 seems to resolve the issue, but this 
> also requires a jquery downgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11463) Node Labels root directory creation doesn't have a retry logic

2023-05-05 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11463:
-
Fix Version/s: 3.4.0

> Node Labels root directory creation doesn't have a retry logic
> --
>
> Key: YARN-11463
> URL: https://issues.apache.org/jira/browse/YARN-11463
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When CS is initialized, it'll [try to create the configured node labels root 
> dir|https://github.com/apache/hadoop/blob/7169ec450957e5602775c3cd6fe1bf0b95773dfb/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/store/AbstractFSNodeStore.java#L69].
>  This however doesn't implement any kind of retry logic (in contrast to the 
> RM FS state store or ZK state store), hence if the distributed file system is 
> unavailable at the exact moment CS tries to start it'll fail. A retry logic 
> could be implemented to improve the robustness of the startup process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11450) Improvements for TestYarnConfigurationFields and TestConfigurationFieldsBase

2023-05-02 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11450:
-
Fix Version/s: 3.4.0

> Improvements for TestYarnConfigurationFields and TestConfigurationFieldsBase
> 
>
> Key: YARN-11450
> URL: https://issues.apache.org/jira/browse/YARN-11450
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11459) Consider changing label called "max resource" on UIv1 and UIv2

2023-04-27 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11459.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Consider changing label called "max resource" on UIv1 and UIv2
> --
>
> Key: YARN-11459
> URL: https://issues.apache.org/jira/browse/YARN-11459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Riya Khandelwal
>Assignee: Riya Khandelwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Related discussion is in 
> [-ENGESC-16432-|https://jira.cloudera.com/browse/ENGESC-16432] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11463) Node Labels root directory creation doesn't have a retry logic

2023-04-21 Thread Benjamin Teke (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11463.
--
Hadoop Flags: Reviewed
Target Version/s: 3.4.0
  Resolution: Fixed

> Node Labels root directory creation doesn't have a retry logic
> --
>
> Key: YARN-11463
> URL: https://issues.apache.org/jira/browse/YARN-11463
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
>
> When CS is initialized, it'll [try to create the configured node labels root 
> dir|https://github.com/apache/hadoop/blob/7169ec450957e5602775c3cd6fe1bf0b95773dfb/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/store/AbstractFSNodeStore.java#L69].
>  This however doesn't implement any kind of retry logic (in contrast to the 
> RM FS state store or ZK state store), hence if the distributed file system is 
> unavailable at the exact moment CS tries to start it'll fail. A retry logic 
> could be implemented to improve the robustness of the startup process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11463) Node Labels root directory creation doesn't have a retry logic

2023-04-12 Thread Benjamin Teke (Jira)

Benjamin Teke created YARN-11463:


 Summary: Node Labels root directory creation doesn't have a retry 
logic
 Key: YARN-11463
 URL: https://issues.apache.org/jira/browse/YARN-11463
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Reporter: Benjamin Teke


When CS is initialized, it'll [try to create the configured node labels root 
dir|https://github.com/apache/hadoop/blob/7169ec450957e5602775c3cd6fe1bf0b95773dfb/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/store/AbstractFSNodeStore.java#L69].
 This however doesn't implement any kind of retry logic (in contrast to the RM 
FS state store or ZK state store), hence if the distributed file system is 
unavailable at the exact moment CS tries to start it'll fail. A retry logic 
could be implemented to improve the robustness of the startup process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

1 2 3 4 5 6 >

1 - 100 of 585 matches

Mail list logo