[jira] [Assigned] (YARN-11681) Update the cgroup documentation with v2 support
[ https://issues.apache.org/jira/browse/YARN-11681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke reassigned YARN-11681: Assignee: Benjamin Teke > Update the cgroup documentation with v2 support > --- > > Key: YARN-11681 > URL: https://issues.apache.org/jira/browse/YARN-11681 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Labels: pull-request-available > > Update the related > [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html] > with v2 support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11692) Support mixed cgroup v1/v2 controller structure
[ https://issues.apache.org/jira/browse/YARN-11692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11692. -- Hadoop Flags: Reviewed Target Version/s: 3.5.0 Resolution: Fixed > Support mixed cgroup v1/v2 controller structure > --- > > Key: YARN-11692 > URL: https://issues.apache.org/jira/browse/YARN-11692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > > There were heavy changes on the device side in cgroup v2. To keep supporting > FGPAs and GPUs short term, mixed structures where some of the cgroup > controllers are from v1 while others from v2 should be supported. More info: > https://dropbear.xyz/2023/05/23/devices-with-cgroup-v2/ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-11669) [Umbrella] cgroup v2 support
[ https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke reopened YARN-11669: -- > [Umbrella] cgroup v2 support > > > Key: YARN-11669 > URL: https://issues.apache.org/jira/browse/YARN-11669 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Ferenc Erdelyi >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.5.0 > > > cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 > already moved to cgroup v2 as a default, hence YARN should support it. This > umbrella tracks the required work. > [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] > A way to test the newly added features: > # Turn on cgroup v1 based on the current > [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]. > # System prerequisites: > ## the file {{/etc/mtab}} should contain a mount path with the file system > type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's > ## the {{cgroup.subtree_control}} file should contain the necessary > controllers (update it with: {{echo "+cpu +io +memory" > > cgroup.subtree_control}}) > ## either create the YARN hierarchy and give recursive access to the user > running the NM on the node. The hierarchy is {{hadoop-yarn}} by default > (controller by > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and > recursive mode is required, because as soon as the directory is created it > will be filled with the controller files which YARN will try to edit. > ### Alternatively if the NM process user has access rights on the > {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the > {{cgroup.subtree_control}} file. > # YARN configuration > ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should > point to the directory where the cgroup2 structure is mounted and the > {{hadoop-yarn}} hierarchy was created > ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be > set to {{true}} > ## Enable a cgroup controller, like {{yarn. nodemanager. resource. > cpu.enabled}}: {{true}} > # Launch the NM and monitor the cgroup files on container launches (i.e: > {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}}) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11669) [Umbrella] cgroup v2 support
[ https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11669: - Fix Version/s: (was: 3.5.0) > [Umbrella] cgroup v2 support > > > Key: YARN-11669 > URL: https://issues.apache.org/jira/browse/YARN-11669 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Ferenc Erdelyi >Assignee: Benjamin Teke >Priority: Major > > cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 > already moved to cgroup v2 as a default, hence YARN should support it. This > umbrella tracks the required work. > [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] > A way to test the newly added features: > # Turn on cgroup v1 based on the current > [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]. > # System prerequisites: > ## the file {{/etc/mtab}} should contain a mount path with the file system > type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's > ## the {{cgroup.subtree_control}} file should contain the necessary > controllers (update it with: {{echo "+cpu +io +memory" > > cgroup.subtree_control}}) > ## either create the YARN hierarchy and give recursive access to the user > running the NM on the node. The hierarchy is {{hadoop-yarn}} by default > (controller by > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and > recursive mode is required, because as soon as the directory is created it > will be filled with the controller files which YARN will try to edit. > ### Alternatively if the NM process user has access rights on the > {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the > {{cgroup.subtree_control}} file. > # YARN configuration > ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should > point to the directory where the cgroup2 structure is mounted and the > {{hadoop-yarn}} hierarchy was created > ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be > set to {{true}} > ## Enable a cgroup controller, like {{yarn. nodemanager. resource. > cpu.enabled}}: {{true}} > # Launch the NM and monitor the cgroup files on container launches (i.e: > {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}}) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11669) [Umbrella] cgroup v2 support
[ https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11669. -- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > [Umbrella] cgroup v2 support > > > Key: YARN-11669 > URL: https://issues.apache.org/jira/browse/YARN-11669 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Ferenc Erdelyi >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.5.0 > > > cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 > already moved to cgroup v2 as a default, hence YARN should support it. This > umbrella tracks the required work. > [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] > A way to test the newly added features: > # Turn on cgroup v1 based on the current > [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]. > # System prerequisites: > ## the file {{/etc/mtab}} should contain a mount path with the file system > type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's > ## the {{cgroup.subtree_control}} file should contain the necessary > controllers (update it with: {{echo "+cpu +io +memory" > > cgroup.subtree_control}}) > ## either create the YARN hierarchy and give recursive access to the user > running the NM on the node. The hierarchy is {{hadoop-yarn}} by default > (controller by > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and > recursive mode is required, because as soon as the directory is created it > will be filled with the controller files which YARN will try to edit. > ### Alternatively if the NM process user has access rights on the > {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the > {{cgroup.subtree_control}} file. > # YARN configuration > ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should > point to the directory where the cgroup2 structure is mounted and the > {{hadoop-yarn}} hierarchy was created > ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be > set to {{true}} > ## Enable a cgroup controller, like {{yarn. nodemanager. resource. > cpu.enabled}}: {{true}} > # Launch the NM and monitor the cgroup files on container launches (i.e: > {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}}) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11689) Update the cgroup v2 init error handling to provide more straightforward error messages
[ https://issues.apache.org/jira/browse/YARN-11689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11689: - Description: The method AbstractCGroupsHandler.getErrorWithDetails hides quite a lot of information. The cgroup v2 init should be more stable and it should be updated to show the exact step where the it failed. (was: The method AbstractCGroupsHandler.getErrorWithDetails hides quite a lot of information. It would be useful to show the underlying exception and it's message as well, by default.) > Update the cgroup v2 init error handling to provide more straightforward > error messages > --- > > Key: YARN-11689 > URL: https://issues.apache.org/jira/browse/YARN-11689 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > > The method AbstractCGroupsHandler.getErrorWithDetails hides quite a lot of > information. The cgroup v2 init should be more stable and it should be > updated to show the exact step where the it failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11689) Update the cgroup v2 init error handling to provide more straightforward error messages
[ https://issues.apache.org/jira/browse/YARN-11689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11689: - Summary: Update the cgroup v2 init error handling to provide more straightforward error messages (was: Update getErrorWithDetails method to provide more meaningful error messages) > Update the cgroup v2 init error handling to provide more straightforward > error messages > --- > > Key: YARN-11689 > URL: https://issues.apache.org/jira/browse/YARN-11689 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > > The method AbstractCGroupsHandler.getErrorWithDetails hides quite a lot of > information. It would be useful to show the underlying exception and it's > message as well, by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11669) [Umbrella] cgroup v2 support
[ https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11669: - Description: cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already moved to cgroup v2 as a default, hence YARN should support it. This umbrella tracks the required work. [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] A way to test the newly added features: # Turn on cgroup v1 based on the current [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]. # System prerequisites: ## the file {{/etc/mtab}} should contain a mount path with the file system type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's ## the {{cgroup.subtree_control}} file should contain the necessary controllers (update it with: {{echo "+cpu +io +memory" > cgroup.subtree_control}}) ## either create the YARN hierarchy and give recursive access to the user running the NM on the node. The hierarchy is {{hadoop-yarn}} by default (controller by {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and recursive mode is required, because as soon as the directory is created it will be filled with the controller files which YARN will try to edit. ### Alternatively if the NM process user has access rights on the {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the {{cgroup.subtree_control}} file. # YARN configuration ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should point to the directory where the cgroup2 structure is mounted and the {{hadoop-yarn}} hierarchy was created ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be set to {{true}} ## Enable a cgroup controller, like {{yarn. nodemanager. resource. cpu.enabled}}: {{true}} # Launch the NM and monitor the cgroup files on container launches (i.e: {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}}) was: cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already moved to cgroup v2 as a default, hence YARN should support it. This umbrella tracks the required work. [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] A way to test the newly added features: # Turn on cgroup v1 based on the current [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]. # System prerequisites: ## the file {{/etc/mtab}} should contain a mount path with the file system type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's ## the {{cgroup.subtree_control}} file should contain the necessary controllers (update it with: {{echo "+cpu +io +memory" > cgroup.subtree_control}}) ## either create the YARN hierarchy and make it owned by the user running the NM on the node. The hierarchy is {{hadoop-yarn}} by default (controller by {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and {{chown -R user:group /sys/fs/cgroup/hadoop-yarn}} is needed. -R is required, because as soon as the directory is created it will be filled with the controller files which YARN will try to edit. ### Alternatively if the NM process user has access rights on the {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the {{cgroup.subtree_control}} file. # YARN configuration ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should point to the directory where the cgroup2 structure is mounted and the {{hadoop-yarn}} hierarchy was created ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be set to {{true}} ## Enable a cgroup controller, like {{yarn. nodemanager. resource. cpu.enabled}}: {{true}} # Launch the NM and monitor the cgroup files on container launches (i.e: {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}}) > [Umbrella] cgroup v2 support > > > Key: YARN-11669 > URL: https://issues.apache.org/jira/browse/YARN-11669 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Ferenc Erdelyi >Assignee: Benjamin Teke >Priority: Major > > cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 > already moved to cgroup v2 as a default, hence YARN should support it. This > umbrella tracks the required work. > [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] > A way to test the newly added features: > # Turn on cgroup v1 based on the current > [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]. > # System prerequisites: > ## the file {{/etc/mtab}} should contain a mount path with the file system > type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's > ## the
[jira] [Updated] (YARN-11669) [Umbrella] cgroup v2 support
[ https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11669: - Description: cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already moved to cgroup v2 as a default, hence YARN should support it. This umbrella tracks the required work. [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] A way to test the newly added features: # Turn on cgroup v1 based on the current [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]. # System prerequisites: ## the file {{/etc/mtab}} should contain a mount path with the file system type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's ## the {{cgroup.subtree_control}} file should contain the necessary controllers (update it with: {{echo "+cpu +io +memory" > cgroup.subtree_control}}) ## either create the YARN hierarchy and make it owned by the user running the NM on the node. The hierarchy is {{hadoop-yarn}} by default (controller by {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and {{chown -R user:group /sys/fs/cgroup/hadoop-yarn}} is needed. -R is required, because as soon as the directory is created it will be filled with the controller files which YARN will try to edit. ### Alternatively if the NM process user has access rights on the {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the {{cgroup.subtree_control}} file. # YARN configuration ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should point to the directory where the cgroup2 structure is mounted and the {{hadoop-yarn}} hierarchy was created ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be set to {{true}} ## Enable a cgroup controller, like {{yarn. nodemanager. resource. cpu.enabled}}: {{true}} # Launch the NM and monitor the cgroup files on container launches (i.e: {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}}) was: cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already moved to cgroup v2 as a default, hence YARN should support it. This umbrella tracks the required work. [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] A way to test the newly added features: # Turn on cgroup v1 based on the current [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]. # System prerequisites: ## the file {{/etc/mtab}} should contain a mount path with the file system type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's ## the {{cgroup.subtree_control}} file should contain the necessary controllers (update it with: {{echo "+cpu +io +memory" > cgroup.subtree_control}}) ## either create the YARN hierarchy and make it owned by the user running the NM on the node. The hierarchy is {{hadoop-yarn}} by default (controller by {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and {{chown -R yarn:hadoop /sys/fs/cgroup/hadoop-yarn}} is needed. -R is required, because as soon as the directory is created it will be filled with the controller files which YARN will try to edit. ### Alternatively if the NM process user has access rights on the {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the {{cgroup.subtree_control}} file. # YARN configuration ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should point to the directory where the cgroup2 structure is mounted and the {{hadoop-yarn}} hierarchy was created ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be set to {{true}} ## Enable a cgroup controller, like {{yarn. nodemanager. resource. cpu.enabled}}: {{true}} # Launch the NM and monitor the cgroup files on container launches (i.e: {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}}) > [Umbrella] cgroup v2 support > > > Key: YARN-11669 > URL: https://issues.apache.org/jira/browse/YARN-11669 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Ferenc Erdelyi >Assignee: Benjamin Teke >Priority: Major > > cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 > already moved to cgroup v2 as a default, hence YARN should support it. This > umbrella tracks the required work. > [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] > A way to test the newly added features: > # Turn on cgroup v1 based on the current > [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]. > # System prerequisites: > ## the file {{/etc/mtab}} should contain a mount path with the file system > type {{cgroup2}}, by default this could
[jira] [Updated] (YARN-11669) [Umbrella] cgroup v2 support
[ https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11669: - Description: cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already moved to cgroup v2 as a default, hence YARN should support it. This umbrella tracks the required work. [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] A way to test the newly added features: # Turn on cgroup v1 based on the current [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]. # System prerequisites: ## the file {{/etc/mtab}} should contain a mount path with the file system type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's ## the {{cgroup.subtree_control}} file should contain the necessary controllers (update it with: {{echo "+cpu +io +memory" > cgroup.subtree_control}}) ## either create the YARN hierarchy and make it owned by the user running the NM on the node. The hierarchy is {{hadoop-yarn}} by default (controller by {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and {{chown -R yarn:hadoop /sys/fs/cgroup/hadoop-yarn}} is needed. -R is required, because as soon as the directory is created it will be filled with the controller files which YARN will try to edit. ### Alternatively if the NM process user has access rights on the {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the {{cgroup.subtree_control}} file. # YARN configuration ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should point to the directory where the cgroup2 structure is mounted and the {{hadoop-yarn}} hierarchy was created ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be set to {{true}} ## Enable a cgroup controller, like {{yarn. nodemanager. resource. cpu.enabled}}: {{true}} # Launch the NM and monitor the cgroup files on container launches (i.e: {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}}) was: cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already moved to cgroup v2 as a default, hence YARN should support it. This umbrella tracks the required work. [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] > [Umbrella] cgroup v2 support > > > Key: YARN-11669 > URL: https://issues.apache.org/jira/browse/YARN-11669 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Ferenc Erdelyi >Assignee: Benjamin Teke >Priority: Major > > cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 > already moved to cgroup v2 as a default, hence YARN should support it. This > umbrella tracks the required work. > [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] > A way to test the newly added features: > # Turn on cgroup v1 based on the current > [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]. > # System prerequisites: > ## the file {{/etc/mtab}} should contain a mount path with the file system > type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's > ## the {{cgroup.subtree_control}} file should contain the necessary > controllers (update it with: {{echo "+cpu +io +memory" > > cgroup.subtree_control}}) > ## either create the YARN hierarchy and make it owned by the user running the > NM on the node. The hierarchy is {{hadoop-yarn}} by default (controller by > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and {{chown > -R yarn:hadoop /sys/fs/cgroup/hadoop-yarn}} is needed. -R is required, > because as soon as the directory is created it will be filled with the > controller files which YARN will try to edit. > ### Alternatively if the NM process user has access rights on the > {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the > {{cgroup.subtree_control}} file. > # YARN configuration > ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should > point to the directory where the cgroup2 structure is mounted and the > {{hadoop-yarn}} hierarchy was created > ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be > set to {{true}} > ## Enable a cgroup controller, like {{yarn. nodemanager. resource. > cpu.enabled}}: {{true}} > # Launch the NM and monitor the cgroup files on container launches (i.e: > {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}}) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11692) Support mixed cgroup v1/v2 controller structure
[ https://issues.apache.org/jira/browse/YARN-11692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11692: - Description: There were heavy changes on the device side in cgroup v2. To keep supporting FGPAs and GPUs short term, mixed structures where some of the cgroup controllers are from v1 while others from v2 should be supported. More info: https://dropbear.xyz/2023/05/23/devices-with-cgroup-v2/ (was: There were heavy changes on the device side in cgroup v2. To keep supporting FGPAs and GPUs short term, mixed structures where some of the cgroup controllers are from v1 while others from v2 should be supported. ) > Support mixed cgroup v1/v2 controller structure > --- > > Key: YARN-11692 > URL: https://issues.apache.org/jira/browse/YARN-11692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Priority: Major > > There were heavy changes on the device side in cgroup v2. To keep supporting > FGPAs and GPUs short term, mixed structures where some of the cgroup > controllers are from v1 while others from v2 should be supported. More info: > https://dropbear.xyz/2023/05/23/devices-with-cgroup-v2/ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11692) Support mixed cgroup v1/v2 controller structure
Benjamin Teke created YARN-11692: Summary: Support mixed cgroup v1/v2 controller structure Key: YARN-11692 URL: https://issues.apache.org/jira/browse/YARN-11692 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke There were heavy changes on the device side in cgroup v2. To keep supporting FGPAs and GPUs short term, mixed structures where some of the cgroup controllers are from v1 while others from v2 should be supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11690) Update container executor to use CGROUP2_SUPER_MAGIC in cgroup 2 scenarios
Benjamin Teke created YARN-11690: Summary: Update container executor to use CGROUP2_SUPER_MAGIC in cgroup 2 scenarios Key: YARN-11690 URL: https://issues.apache.org/jira/browse/YARN-11690 Project: Hadoop YARN Issue Type: Sub-task Components: container-executor Reporter: Benjamin Teke Assignee: Benjamin Teke The container executor function {{write_pid_to_cgroup_as_root}} writes the PID of the newly launched container to the correct cgroup.procs file. However it checks if the file is mounted on a cgroup filesystem, and does that check using the magic number, which differs for v1 and v2. This should handle v1 or v2 filesystems as well. {code:java} /** * Write the pid of the current process to the cgroup file. * cgroup_file: Path to cgroup file where pid needs to be written to. */ static int write_pid_to_cgroup_as_root(const char* cgroup_file, pid_t pid) { int rc = 0; uid_t user = geteuid(); gid_t group = getegid(); if (change_effective_user(0, 0) != 0) { rc = -1; goto cleanup; } // statfs struct statfs buf; if (statfs(cgroup_file, ) == -1) { fprintf(LOGFILE, "Can't statfs file %s as node manager - %s\n", cgroup_file, strerror(errno)); rc = -1; goto cleanup; } else if (buf.f_type != CGROUP_SUPER_MAGIC) { fprintf(LOGFILE, "Pid file %s is not located on cgroup filesystem\n", cgroup_file); rc = -1; goto cleanup; } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11689) Update getErrorWithDetails method to provide more meaningful error messages
Benjamin Teke created YARN-11689: Summary: Update getErrorWithDetails method to provide more meaningful error messages Key: YARN-11689 URL: https://issues.apache.org/jira/browse/YARN-11689 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke The method AbstractCGroupsHandler.getErrorWithDetails hides quite a lot of information. It would be useful to show the underlying exception and it's message as well, by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11689) Update getErrorWithDetails method to provide more meaningful error messages
[ https://issues.apache.org/jira/browse/YARN-11689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke reassigned YARN-11689: Assignee: Benjamin Teke > Update getErrorWithDetails method to provide more meaningful error messages > --- > > Key: YARN-11689 > URL: https://issues.apache.org/jira/browse/YARN-11689 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > > The method AbstractCGroupsHandler.getErrorWithDetails hides quite a lot of > information. It would be useful to show the underlying exception and it's > message as well, by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke reassigned YARN-11191: Assignee: Tamas Domok > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11687) Update CGroupsResourceCalculator to track usages using cgroupv2
Benjamin Teke created YARN-11687: Summary: Update CGroupsResourceCalculator to track usages using cgroupv2 Key: YARN-11687 URL: https://issues.apache.org/jira/browse/YARN-11687 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke [CGroupsResourceCalculator|https://github.com/apache/hadoop/blob/f609460bda0c2bd87dd3580158e549e2f34f14d5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsResourceCalculator.java] should also be updated to handle the cgroup v2 changes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11679) Update GpuResourceHandler for cgroup v2 support
[ https://issues.apache.org/jira/browse/YARN-11679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838163#comment-17838163 ] Benjamin Teke commented on YARN-11679: -- GPU support is a bit tricky, as cgroup v2 doesn't have interface files on top of the device controller, it's now implemented on top of BPF. From the [docs|https://docs.kernel.org/admin-guide/cgroup-v2.html]: {quote}Cgroup v2 device controller has no interface files and is implemented on top of cgroup BPF. To control access to device files, a user may create bpf programs of type BPF_PROG_TYPE_CGROUP_DEVICE and attach them to cgroups with BPF_CGROUP_DEVICE flag. On an attempt to access a device file, corresponding BPF programs will be executed, and depending on the return value the attempt will succeed or fail with -EPERM. A BPF_PROG_TYPE_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx structure, which describes the device access attempt: access type (mknod/read/write) and device (type, major and minor numbers). If the program returns 0, the attempt fails with -EPERM, otherwise it succeeds. An example of BPF_PROG_TYPE_CGROUP_DEVICE program may be found in tools/testing/selftests/bpf/progs/dev_cgroup.c in the kernel source tree. {quote} > Update GpuResourceHandler for cgroup v2 support > --- > > Key: YARN-11679 > URL: https://issues.apache.org/jira/browse/YARN-11679 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Priority: Major > > cgroup v2 has some changes in various controllers (some changed their > functionality, some were removed). This task is about checking if > GpuResourceHandler's > [implementation|https://github.com/apache/hadoop/blob/e8fa192f07b6f2e7a0b03813edca03c505a8ac1b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/gpu/GpuResourceHandlerImpl.java#L45] > need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11685) Create a config to enable/disable cgroup v2 functionality
Benjamin Teke created YARN-11685: Summary: Create a config to enable/disable cgroup v2 functionality Key: YARN-11685 URL: https://issues.apache.org/jira/browse/YARN-11685 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke Various OS's mount the cgroup v2 differently, some of them mount both the v1 and v2 structure, others mount a hybrid structure. To avoid initialization issues the cgroup v1/v2 functionality should be set by a config property. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11681) Update the cgroup documentation with v2 support
Benjamin Teke created YARN-11681: Summary: Update the cgroup documentation with v2 support Key: YARN-11681 URL: https://issues.apache.org/jira/browse/YARN-11681 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke Update the related [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html] with v2 support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11680) Update FpgaResourceHandler for cgroup v2 support
Benjamin Teke created YARN-11680: Summary: Update FpgaResourceHandler for cgroup v2 support Key: YARN-11680 URL: https://issues.apache.org/jira/browse/YARN-11680 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if FpgaResourceHandler's [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/fpga/FpgaResourceHandlerImpl.java#L55] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11679) Update GpuResourceHandler for cgroup v2 support
Benjamin Teke created YARN-11679: Summary: Update GpuResourceHandler for cgroup v2 support Key: YARN-11679 URL: https://issues.apache.org/jira/browse/YARN-11679 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if GpuResourceHandler's [implementation|https://github.com/apache/hadoop/blob/e8fa192f07b6f2e7a0b03813edca03c505a8ac1b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/gpu/GpuResourceHandlerImpl.java#L45] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11678) Update CGroupElasticMemoryController for cgroup v2 support
Benjamin Teke created YARN-11678: Summary: Update CGroupElasticMemoryController for cgroup v2 support Key: YARN-11678 URL: https://issues.apache.org/jira/browse/YARN-11678 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if CGroupElasticMemoryController's [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupElasticMemoryController.java#L58] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11677) Update OutboundBandwidthResourceHandler implementation for cgroup v2 support
Benjamin Teke created YARN-11677: Summary: Update OutboundBandwidthResourceHandler implementation for cgroup v2 support Key: YARN-11677 URL: https://issues.apache.org/jira/browse/YARN-11677 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if OutboundBandwidthResourceHandler's [implementation|https://github.com/apache/hadoop/blob/2064ca015d1584263aac0cc20c60b925a3aff612/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/TrafficControlBandwidthHandlerImpl.java#L43] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11676) Update CGroupsBlkioResourceHandler implementation for cgroup v2 support
Benjamin Teke created YARN-11676: Summary: Update CGroupsBlkioResourceHandler implementation for cgroup v2 support Key: YARN-11676 URL: https://issues.apache.org/jira/browse/YARN-11676 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if CGroupsBlkioResourceHandler's [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsBlkioResourceHandlerImpl.java#L46] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11675) Update MemoryResourceHandler implementation for cgroup v2 support
Benjamin Teke created YARN-11675: Summary: Update MemoryResourceHandler implementation for cgroup v2 support Key: YARN-11675 URL: https://issues.apache.org/jira/browse/YARN-11675 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if MemoryResourceHandler's [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11674) Update CpuResourceHandler implementation for cgroup v2 support
Benjamin Teke created YARN-11674: Summary: Update CpuResourceHandler implementation for cgroup v2 support Key: YARN-11674 URL: https://issues.apache.org/jira/browse/YARN-11674 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if CpuResourceHandler's [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsCpuResourceHandlerImpl.java#L60] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11673) Extend the cgroup mount functionality to mount the v2 structure
Benjamin Teke created YARN-11673: Summary: Extend the cgroup mount functionality to mount the v2 structure Key: YARN-11673 URL: https://issues.apache.org/jira/browse/YARN-11673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke YARN has a --mount-cgroup operation in the [container-executor|https://github.com/apache/hadoop/blob/9c7b8cf54ea88833d54fc71a9612c448dc0eb78d/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L2929] which mounts each controller's cgroup folder to a specified path. In cgroup v2 the controller structure changed, it's flat now, so there are no more separate controller paths. To keep being compatible with v1 a new mount method should be added, but its functionality could be simplified quite a bit for v2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11672) Create a CgroupHandler implementation for cgroup v2
Benjamin Teke created YARN-11672: Summary: Create a CgroupHandler implementation for cgroup v2 Key: YARN-11672 URL: https://issues.apache.org/jira/browse/YARN-11672 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke Assignee: Benjamin Teke [CGroupsHandler's|https://github.com/apache/hadoop/blob/69b328943edf2f61c8fc139934420e3f10bf3813/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsHandler.java#L36] current implementation holds the functionality to mount and setup the YARN specific cgroup v1 functionality. A similar v2 implementation should be created that allows initialising the v2 structure. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11669) [Umbrella] cgroup v2 support
[ https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11669: - Summary: [Umbrella] cgroup v2 support (was: cgroups v2 support for YARN) > [Umbrella] cgroup v2 support > > > Key: YARN-11669 > URL: https://issues.apache.org/jira/browse/YARN-11669 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Ferenc Erdelyi >Assignee: Benjamin Teke >Priority: Major > > The cgroups v2 is becoming the default for OSs, like RHEL9. > Support for YARN has to be implemented. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11669) [Umbrella] cgroup v2 support
[ https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11669: - Description: cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 already moved to cgroup v2 as a default, hence YARN should support it. This umbrella tracks the required work. [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] was: The cgroups v2 is becoming the default for OSs, like RHEL9. Support for YARN has to be implemented. > [Umbrella] cgroup v2 support > > > Key: YARN-11669 > URL: https://issues.apache.org/jira/browse/YARN-11669 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Ferenc Erdelyi >Assignee: Benjamin Teke >Priority: Major > > cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 > already moved to cgroup v2 as a default, hence YARN should support it. This > umbrella tracks the required work. > [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5305) Yarn Application Log Aggregation fails due to NM can not get correct HDFS delegation token III
[ https://issues.apache.org/jira/browse/YARN-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-5305. - Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > Yarn Application Log Aggregation fails due to NM can not get correct HDFS > delegation token III > -- > > Key: YARN-5305 > URL: https://issues.apache.org/jira/browse/YARN-5305 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Xianyin Xin >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Different with YARN-5098 and YARN-5302, this problem happens when AM submits > a startContainer request with a new HDFS token (say, tokenB) which is not > managed by YARN, so two tokens exist in the credentials of the user on NM, > one is tokenB, the other is the one renewed on RM (tokenA). If tokenB is > selected when connect to HDFS and tokenB expires, exception happens. > Supplementary: this problem happen due to that AM didn't use the service name > as the token alias in credentials, so two tokens for the same service can > co-exist in one credentials. TokenSelector can only select the first matched > token, it doesn't care if the token is valid or not. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10889) [Umbrella] Queue Creation in Capacity Scheduler - Tech debts
[ https://issues.apache.org/jira/browse/YARN-10889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-10889. -- Fix Version/s: 3.4.0 Target Version/s: 3.4.0 Resolution: Fixed > [Umbrella] Queue Creation in Capacity Scheduler - Tech debts > > > Key: YARN-10889 > URL: https://issues.apache.org/jira/browse/YARN-10889 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.4.0 > > > Follow-up of YARN-10496 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10889) [Umbrella] Queue Creation in Capacity Scheduler - Tech debts
[ https://issues.apache.org/jira/browse/YARN-10889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-10889: - Fix Version/s: 3.5.0 > [Umbrella] Queue Creation in Capacity Scheduler - Tech debts > > > Key: YARN-10889 > URL: https://issues.apache.org/jira/browse/YARN-10889 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.4.0, 3.5.0 > > > Follow-up of YARN-10496 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup
[ https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11041. -- Resolution: Fixed > Replace all occurences of queuePath with the new QueuePath class - followup > --- > > Key: YARN-11041 > URL: https://issues.apache.org/jira/browse/YARN-11041 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Tibor Kovács >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > The QueuePath class was introduced in YARN-10897, however, its current > adoption happened only for code changes after this JIRA. We need to adopt it > retrospectively. > > A lot of changes are introduced via ticket YARN-10982. The replacing should > be continued by touching the next comments: > > [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937] > I think this could be also refactored in a follow-up jira so the string magic > could probably be replaced with some more elegant solution. Though, I think > this would be too much in this patch, hence I do suggest the follow-up jira.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096] > [~bteke] [ |https://github.com/9uapaw] [~gandras] [ > \|https://github.com/9uapaw] Thoughts?| > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750] > +1, even the QueuePath object could have some kind of support for this.| > |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244] > Agreed, let's handle it in a followup!| > > > > [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717] > There are many string operations in this class: > E.g. * getQueuePrefix that works with the full queue path > * getNodeLabelPrefix that also works with the full queue path| > I suggest to create a static class, called "QueuePrefixes" or something like > that and add some static methods there to convert the QueuePath object to > those various queue prefix strings that are ultimately keys in the > Configuration object. > > > > [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119] > This seems hacky, just based on the constructor parameter names of QueuePath: > parent, leaf. > The AQC Template prefix is not the leaf, obviously. > Could we somehow circumvent this?| > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207] > Maybe a factory method could be created, which returns a new QueuePath with > the parent set as the original queuePath. I.e > rootQueuePath.createChild(String childName) -> this could return a new > QueuePath object with root.childName path, and rootQueuePath as parent.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033] > Looking at this getQueues method, I realized almost all the callers are using > some kind of string magic that should be addressed with this patch. > For example, take a look at: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue > I think getQueues should also receive the QueuePath object instead of > Strings.| > > > > [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b] > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967] > Nit: Gets the queue path object. > The object of the queue suggests a CSQueue object.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765922133] > Will fix the nit upon commit if I'm fine with the whole patch. Thanks for > noticing.| > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (YARN-10889) [Umbrella] Queue Creation in Capacity Scheduler - Tech debts
[ https://issues.apache.org/jira/browse/YARN-10889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke reassigned YARN-10889: Assignee: Benjamin Teke (was: Szilard Nemeth) > [Umbrella] Queue Creation in Capacity Scheduler - Tech debts > > > Key: YARN-10889 > URL: https://issues.apache.org/jira/browse/YARN-10889 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Benjamin Teke >Priority: Major > > Follow-up of YARN-10496 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10921) AbstractCSQueue: Node Labels logic is scattered and iteration logic is repeated all over the place
[ https://issues.apache.org/jira/browse/YARN-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-10921: - Parent Issue: YARN-11652 (was: YARN-10889) > AbstractCSQueue: Node Labels logic is scattered and iteration logic is > repeated all over the place > -- > > Key: YARN-10921 > URL: https://issues.apache.org/jira/browse/YARN-10921 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Szilard Nemeth >Assignee: Peter Szucs >Priority: Minor > > TODO items: > - Check original Node labels epic / jiras? > - Think about ways to improve repetitive iteration on configuredNodeLabels > - Search for: "String label" in code > Code blocks to handle Node labels: > - AbstractCSQueue#setupQueueConfigs > - AbstractCSQueue#getQueueConfigurations > - AbstractCSQueue#accessibleToPartition > - AbstractCSQueue#getNodeLabelsForQueue > - AbstractCSQueue#updateAbsoluteCapacities > - AbstractCSQueue#updateConfigurableResourceRequirement > - CSQueueUtils#loadCapacitiesByLabelsFromConf > - AutoCreatedLeafQueue -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10888) [Umbrella] New capacity modes for CS
[ https://issues.apache.org/jira/browse/YARN-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-10888. -- Resolution: Fixed > [Umbrella] New capacity modes for CS > > > Key: YARN-10888 > URL: https://issues.apache.org/jira/browse/YARN-10888 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.4.0 > > Attachments: capacity_scheduler_queue_capacity.pdf > > > *Investigate how resource allocation configuration could be more consistent > in CapacityScheduler* > It would be nice if everywhere where a capacity can be defined could be > defined the same way: > * With fixed amounts (e.g. 1 GB memory, 8 vcores, 3 GPU) > * With percentages > ** Percentage of all resources (eg 10% of all memory, vcore, GPU) > ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU) > * Allow mixing different modes under one hierarchy but not under the same > parent queues. > We need to determine all configuration options where capacities can be > defined, and see if it is possible to extend the configuration, or if it > makes sense in that case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11652) [Umbrella] Follow-up after YARN-10888/YARN-10889
Benjamin Teke created YARN-11652: Summary: [Umbrella] Follow-up after YARN-10888/YARN-10889 Key: YARN-11652 URL: https://issues.apache.org/jira/browse/YARN-11652 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.5.0 Reporter: Benjamin Teke Assignee: Benjamin Teke Follow-up improvements after the changes in YARN-10888/YARN-10889. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10886) Cluster based and parent based max capacity in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-10886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-10886: - Parent Issue: YARN-11652 (was: YARN-10888) > Cluster based and parent based max capacity in Capacity Scheduler > - > > Key: YARN-10886 > URL: https://issues.apache.org/jira/browse/YARN-10886 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Szilard Nemeth >Priority: Major > > We want to introduce the percentage modes relative to the cluster, not the > parent, i.e > The property root.users.maximum-capacity will mean one of the following > things: > *Either Parent Percentage:* maximum capacity relative to its parent. If it’s > set to 50, then it means that the capacity is capped with respect to the > parent. This can be covered by the current format, no change there. > *Or Cluster Percentage:* maximum capacity expressed as a percentage of the > overall cluster capacity. This case is the new scenario, for example: > yarn.scheduler.capacity.root.users.max-capacity = c:50% > yarn.scheduler.capacity.root.users.max-capacity = c:50%, c:30% -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10888) [Umbrella] New capacity modes for CS
[ https://issues.apache.org/jira/browse/YARN-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke reassigned YARN-10888: Assignee: Benjamin Teke (was: Szilard Nemeth) > [Umbrella] New capacity modes for CS > > > Key: YARN-10888 > URL: https://issues.apache.org/jira/browse/YARN-10888 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.4.0 > > Attachments: capacity_scheduler_queue_capacity.pdf > > > *Investigate how resource allocation configuration could be more consistent > in CapacityScheduler* > It would be nice if everywhere where a capacity can be defined could be > defined the same way: > * With fixed amounts (e.g. 1 GB memory, 8 vcores, 3 GPU) > * With percentages > ** Percentage of all resources (eg 10% of all memory, vcore, GPU) > ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU) > * Allow mixing different modes under one hierarchy but not under the same > parent queues. > We need to determine all configuration options where capacities can be > defined, and see if it is possible to extend the configuration, or if it > makes sense in that case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup
[ https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke reopened YARN-11041: -- Reopening because of a compilation failure in the original PR. > Replace all occurences of queuePath with the new QueuePath class - followup > --- > > Key: YARN-11041 > URL: https://issues.apache.org/jira/browse/YARN-11041 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Tibor Kovács >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > The QueuePath class was introduced in YARN-10897, however, its current > adoption happened only for code changes after this JIRA. We need to adopt it > retrospectively. > > A lot of changes are introduced via ticket YARN-10982. The replacing should > be continued by touching the next comments: > > [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937] > I think this could be also refactored in a follow-up jira so the string magic > could probably be replaced with some more elegant solution. Though, I think > this would be too much in this patch, hence I do suggest the follow-up jira.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096] > [~bteke] [ |https://github.com/9uapaw] [~gandras] [ > \|https://github.com/9uapaw] Thoughts?| > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750] > +1, even the QueuePath object could have some kind of support for this.| > |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244] > Agreed, let's handle it in a followup!| > > > > [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717] > There are many string operations in this class: > E.g. * getQueuePrefix that works with the full queue path > * getNodeLabelPrefix that also works with the full queue path| > I suggest to create a static class, called "QueuePrefixes" or something like > that and add some static methods there to convert the QueuePath object to > those various queue prefix strings that are ultimately keys in the > Configuration object. > > > > [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119] > This seems hacky, just based on the constructor parameter names of QueuePath: > parent, leaf. > The AQC Template prefix is not the leaf, obviously. > Could we somehow circumvent this?| > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207] > Maybe a factory method could be created, which returns a new QueuePath with > the parent set as the original queuePath. I.e > rootQueuePath.createChild(String childName) -> this could return a new > QueuePath object with root.childName path, and rootQueuePath as parent.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033] > Looking at this getQueues method, I realized almost all the callers are using > some kind of string magic that should be addressed with this patch. > For example, take a look at: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue > I think getQueues should also receive the QueuePath object instead of > Strings.| > > > > [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b] > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967] > Nit: Gets the queue path object. > The object of the queue suggests a CSQueue object.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765922133] > Will fix the nit upon commit if I'm fine with the whole patch. Thanks for > noticing.| > > -- This message was sent by
[jira] [Updated] (YARN-10888) [Umbrella] New capacity modes for CS
[ https://issues.apache.org/jira/browse/YARN-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-10888: - Fix Version/s: 3.4.0 > [Umbrella] New capacity modes for CS > > > Key: YARN-10888 > URL: https://issues.apache.org/jira/browse/YARN-10888 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.4.0 > > Attachments: capacity_scheduler_queue_capacity.pdf > > > *Investigate how resource allocation configuration could be more consistent > in CapacityScheduler* > It would be nice if everywhere where a capacity can be defined could be > defined the same way: > * With fixed amounts (e.g. 1 GB memory, 8 vcores, 3 GPU) > * With percentages > ** Percentage of all resources (eg 10% of all memory, vcore, GPU) > ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU) > * Allow mixing different modes under one hierarchy but not under the same > parent queues. > We need to determine all configuration options where capacities can be > defined, and see if it is possible to extend the configuration, or if it > makes sense in that case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup
[ https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11041. -- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > Replace all occurences of queuePath with the new QueuePath class - followup > --- > > Key: YARN-11041 > URL: https://issues.apache.org/jira/browse/YARN-11041 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Tibor Kovács >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > The QueuePath class was introduced in YARN-10897, however, its current > adoption happened only for code changes after this JIRA. We need to adopt it > retrospectively. > > A lot of changes are introduced via ticket YARN-10982. The replacing should > be continued by touching the next comments: > > [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937] > I think this could be also refactored in a follow-up jira so the string magic > could probably be replaced with some more elegant solution. Though, I think > this would be too much in this patch, hence I do suggest the follow-up jira.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096] > [~bteke] [ |https://github.com/9uapaw] [~gandras] [ > \|https://github.com/9uapaw] Thoughts?| > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750] > +1, even the QueuePath object could have some kind of support for this.| > |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244] > Agreed, let's handle it in a followup!| > > > > [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717] > There are many string operations in this class: > E.g. * getQueuePrefix that works with the full queue path > * getNodeLabelPrefix that also works with the full queue path| > I suggest to create a static class, called "QueuePrefixes" or something like > that and add some static methods there to convert the QueuePath object to > those various queue prefix strings that are ultimately keys in the > Configuration object. > > > > [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119] > This seems hacky, just based on the constructor parameter names of QueuePath: > parent, leaf. > The AQC Template prefix is not the leaf, obviously. > Could we somehow circumvent this?| > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207] > Maybe a factory method could be created, which returns a new QueuePath with > the parent set as the original queuePath. I.e > rootQueuePath.createChild(String childName) -> this could return a new > QueuePath object with root.childName path, and rootQueuePath as parent.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033] > Looking at this getQueues method, I realized almost all the callers are using > some kind of string magic that should be addressed with this patch. > For example, take a look at: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue > I think getQueues should also receive the QueuePath object instead of > Strings.| > > > > [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b] > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967] > Nit: Gets the queue path object. > The object of the queue suggests a CSQueue object.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765922133] > Will fix the nit upon commit if I'm fine with the whole patch. Thanks for > noticing.| > > -- This message
[jira] [Updated] (YARN-11645) Fix flaky json assert tests in TestRMWebServices
[ https://issues.apache.org/jira/browse/YARN-11645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11645: - Fix Version/s: 3.5.0 (was: 3.4.0) > Fix flaky json assert tests in TestRMWebServices > > > Key: YARN-11645 > URL: https://issues.apache.org/jira/browse/YARN-11645 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.4.0 >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > TestRMWebServicesCapacitySchedDynamicConfig and > TestRMWebServicesCapacitySchedulerMixedMode are flaky due to changes in the > queue order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11639: - Affects Version/s: 3.5.0 (was: 3.4.0) > ConcurrentModificationException and NPE in > PriorityUtilizationQueueOrderingPolicy > - > > Key: YARN-11639 > URL: https://issues.apache.org/jira/browse/YARN-11639 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.2.4, 3.3.6, 3.5.0 >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > Labels: pull-request-available > > When dynamic queue creation is enabled in weight mode and the deletion policy > coincides with the PriorityQueueResourcesForSorting, RM stops assigning > resources because of either ConcurrentModificationException or NPE in > PriorityUtilizationQueueOrderingPolicy. > Reproduced the NPE issue in Java8 and Java11 environment: > {code:java} > ... INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Removing queue: root.dyn.PmvkMgrEBQppu > 2024-01-02 17:00:59,399 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[Thread-11,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605) > {code} > Observed the ConcurrentModificationException in Java8 environment, but could > not reproduce yet: > {code:java} > 2023-10-27 02:50:37,584 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread > Thread[Thread-15,5, main] threw an Exception. > java.util.ConcurrentModificationException > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at >
[jira] [Updated] (YARN-11645) Fix flaky json assert tests in TestRMWebServices
[ https://issues.apache.org/jira/browse/YARN-11645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11645: - Affects Version/s: 3.5.0 (was: 3.4.0) > Fix flaky json assert tests in TestRMWebServices > > > Key: YARN-11645 > URL: https://issues.apache.org/jira/browse/YARN-11645 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.5.0 >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > TestRMWebServicesCapacitySchedDynamicConfig and > TestRMWebServicesCapacitySchedulerMixedMode are flaky due to changes in the > queue order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810368#comment-17810368 ] Benjamin Teke edited comment on YARN-11639 at 1/24/24 12:48 PM: [~bender] Thanks for checking. No, simply just create the backport PRs under this jira, they'll be automatically added as links to this one. was (Author: bteke): [~bender] no, simply just create the backport PRs under this jira, they'll be automatically added as links to this one. > ConcurrentModificationException and NPE in > PriorityUtilizationQueueOrderingPolicy > - > > Key: YARN-11639 > URL: https://issues.apache.org/jira/browse/YARN-11639 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > Labels: pull-request-available > > When dynamic queue creation is enabled in weight mode and the deletion policy > coincides with the PriorityQueueResourcesForSorting, RM stops assigning > resources because of either ConcurrentModificationException or NPE in > PriorityUtilizationQueueOrderingPolicy. > Reproduced the NPE issue in Java8 and Java11 environment: > {code:java} > ... INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Removing queue: root.dyn.PmvkMgrEBQppu > 2024-01-02 17:00:59,399 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[Thread-11,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605) > {code} > Observed the ConcurrentModificationException in Java8 environment, but could > not reproduce yet: > {code:java} > 2023-10-27 02:50:37,584 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread > Thread[Thread-15,5, main] threw an Exception. > java.util.ConcurrentModificationException > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at >
[jira] [Commented] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810368#comment-17810368 ] Benjamin Teke commented on YARN-11639: -- [~bender] no, simply just create the backport PRs under this jira, they'll be automatically added as links to this one. > ConcurrentModificationException and NPE in > PriorityUtilizationQueueOrderingPolicy > - > > Key: YARN-11639 > URL: https://issues.apache.org/jira/browse/YARN-11639 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > Labels: pull-request-available > > When dynamic queue creation is enabled in weight mode and the deletion policy > coincides with the PriorityQueueResourcesForSorting, RM stops assigning > resources because of either ConcurrentModificationException or NPE in > PriorityUtilizationQueueOrderingPolicy. > Reproduced the NPE issue in Java8 and Java11 environment: > {code:java} > ... INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Removing queue: root.dyn.PmvkMgrEBQppu > 2024-01-02 17:00:59,399 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[Thread-11,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605) > {code} > Observed the ConcurrentModificationException in Java8 environment, but could > not reproduce yet: > {code:java} > 2023-10-27 02:50:37,584 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread > Thread[Thread-15,5, main] threw an Exception. > java.util.ConcurrentModificationException > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at
[jira] [Updated] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11639: - Affects Version/s: 3.3.6 3.2.4 3.4.0 > ConcurrentModificationException and NPE in > PriorityUtilizationQueueOrderingPolicy > - > > Key: YARN-11639 > URL: https://issues.apache.org/jira/browse/YARN-11639 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.4.0, 3.2.4, 3.3.6 >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > Labels: pull-request-available > > When dynamic queue creation is enabled in weight mode and the deletion policy > coincides with the PriorityQueueResourcesForSorting, RM stops assigning > resources because of either ConcurrentModificationException or NPE in > PriorityUtilizationQueueOrderingPolicy. > Reproduced the NPE issue in Java8 and Java11 environment: > {code:java} > ... INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Removing queue: root.dyn.PmvkMgrEBQppu > 2024-01-02 17:00:59,399 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[Thread-11,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605) > {code} > Observed the ConcurrentModificationException in Java8 environment, but could > not reproduce yet: > {code:java} > 2023-10-27 02:50:37,584 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread > Thread[Thread-15,5, main] threw an Exception. > java.util.ConcurrentModificationException > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
[jira] [Updated] (YARN-11645) Fix flaky json assert tests in TestRMWebServices
[ https://issues.apache.org/jira/browse/YARN-11645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11645: - Affects Version/s: 3.4.0 (was: 3.5.0) > Fix flaky json assert tests in TestRMWebServices > > > Key: YARN-11645 > URL: https://issues.apache.org/jira/browse/YARN-11645 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.4.0 >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > TestRMWebServicesCapacitySchedDynamicConfig and > TestRMWebServicesCapacitySchedulerMixedMode are flaky due to changes in the > queue order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11645) Fix flaky json assert tests in TestRMWebServices
[ https://issues.apache.org/jira/browse/YARN-11645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11645. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Fix flaky json assert tests in TestRMWebServices > > > Key: YARN-11645 > URL: https://issues.apache.org/jira/browse/YARN-11645 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.5.0 >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > TestRMWebServicesCapacitySchedDynamicConfig and > TestRMWebServicesCapacitySchedulerMixedMode are flaky due to changes in the > queue order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809577#comment-17809577 ] Benjamin Teke commented on YARN-11639: -- Thanks [~bender] for the patch. Can you please check if branch-3.3/3.2 backport is needed? > ConcurrentModificationException and NPE in > PriorityUtilizationQueueOrderingPolicy > - > > Key: YARN-11639 > URL: https://issues.apache.org/jira/browse/YARN-11639 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > Labels: pull-request-available > > When dynamic queue creation is enabled in weight mode and the deletion policy > coincides with the PriorityQueueResourcesForSorting, RM stops assigning > resources because of either ConcurrentModificationException or NPE in > PriorityUtilizationQueueOrderingPolicy. > Reproduced the NPE issue in Java8 and Java11 environment: > {code:java} > ... INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Removing queue: root.dyn.PmvkMgrEBQppu > 2024-01-02 17:00:59,399 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[Thread-11,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605) > {code} > Observed the ConcurrentModificationException in Java8 environment, but could > not reproduce yet: > {code:java} > 2023-10-27 02:50:37,584 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread > Thread[Thread-15,5, main] threw an Exception. > java.util.ConcurrentModificationException > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at
[jira] [Resolved] (YARN-11634) Speed-up TestTimelineClient
[ https://issues.apache.org/jira/browse/YARN-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11634. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Speed-up TestTimelineClient > --- > > Key: YARN-11634 > URL: https://issues.apache.org/jira/browse/YARN-11634 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > The TimelineConnector.class has a hardcoded 1-minute connection time out, > which makes the TestTimelineClient a long-running test (~15:30 min). > Decreasing the timeout to 10ms will speed up the test run (~56 sec). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11630) Passing admin Java options to container localizers
[ https://issues.apache.org/jira/browse/YARN-11630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11630. -- Hadoop Flags: Reviewed Target Version/s: 3.4.0 Resolution: Fixed > Passing admin Java options to container localizers > -- > > Key: YARN-11630 > URL: https://issues.apache.org/jira/browse/YARN-11630 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Peter Szucs >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > > Currently we can specify Java options for container localizers in > _"yarn.nodemanager.container-localizer.java.opts"_ parameter. > The aim of this ticket is to create a parameter which we can use to pass > admin options as well. It would work similarly as the admin Java options we > can pass for Mapreduce jobs, first we should pass the admin options to the > container executor, then the user-defined ones. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11621) Fix intermittently failing unit test: TestAMRMProxy.testAMRMProxyTokenRenewal
[ https://issues.apache.org/jira/browse/YARN-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11621. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Fix intermittently failing unit test: TestAMRMProxy.testAMRMProxyTokenRenewal > - > > Key: YARN-11621 > URL: https://issues.apache.org/jira/browse/YARN-11621 > Project: Hadoop YARN > Issue Type: Test > Components: yarn >Affects Versions: 3.3.6 >Reporter: Susheel Gupta >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > This test seems to be flaky as it failed 3 times out of 200 runs based on the > trunk. > This was fixed earlier with YARN-7020, but it seems it didn't cover all the > flakiness. > h3. > {code:java} > Error Message > Application attempt appattempt_1630750910491_0001_01 doesn't exist in > ApplicationMasterService cache. at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:407) > at > org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor$3.allocate(DefaultRequestInterceptor.java:224) > at > org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor.allocate(DefaultRequestInterceptor.java:135) > at > org.apache.hadoop.yarn.server.nodemanager.amrmproxy.AMRMProxyService.allocate(AMRMProxyService.java:329) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894) > Stacktrace > org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: > Application attempt appattempt_1630750910491_0001_01 doesn't exist in > ApplicationMasterService cache. at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:407) > at > org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor$3.allocate(DefaultRequestInterceptor.java:224) > at > org.apache.hadoop.yarn.server.nodemanager.amrmproxy.DefaultRequestInterceptor.allocate(DefaultRequestInterceptor.java:135) > at > org.apache.hadoop.yarn.server.nodemanager.amrmproxy.AMRMProxyService.allocate(AMRMProxyService.java:329) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989) at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894) at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at >
[jira] [Created] (YARN-11608) QueueCapacityVectorInfo NPE when accesible labels config is used
Benjamin Teke created YARN-11608: Summary: QueueCapacityVectorInfo NPE when accesible labels config is used Key: YARN-11608 URL: https://issues.apache.org/jira/browse/YARN-11608 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.4.0 Reporter: Benjamin Teke Assignee: Benjamin Teke YARN-11514 extended the REST API to contain CapacityVectors for each configured node label. There is an edgecase however: during the initialization the each queue's capacities map will be filled with 0 capacities for the unconfigured, but accessible labels (where there is no configured capacity for the label, however the queue has access to it based on the accessible-node-labels property). A very basic example configuration for this is the following: {code:java} "yarn.scheduler.capacity.root.queues": "a, b" "yarn.scheduler.capacity.root.a.capacity": "50"); "yarn.scheduler.capacity.root.a.accessible-node-labels": "root-a-default-label" "yarn.scheduler.capacity.root.a.maximum-capacity": "50" "yarn.scheduler.capacity.root.b.capacity": "50" {code} root.a has access to root-a-default-label, however there is no configured capacity for it. The capacityVectors are parsed based on the configuredCapacity map (created from the "accessible-node-labels..capacity" configs). When the scheduler info is requested the capacityVectors are collected per label, and the labels used for this are the keySet of the capacity map: {code:java} for (String partitionName : capacities.getExistingNodeLabels()) { QueueCapacityVector queueCapacityVector = queue.getConfiguredCapacityVector(partitionName); queueCapacityVectorInfo = queueCapacityVector == null ? new QueueCapacityVectorInfo(new QueueCapacityVector()) : new QueueCapacityVectorInfo(queue.getConfiguredCapacityVector(partitionName)); {code} {code:java} public Set getExistingNodeLabels() { readLock.lock(); try { return new HashSet(capacitiesMap.keySet()); } finally { readLock.unlock(); } } {code} If the capacitiesMap contains entries that are not "configured", this will result in an NPE, breaking the UI and the REST API: {code:java} INTERNAL_SERVER_ERROR java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.QueueCapacityVectorInfo.(QueueCapacityVectorInfo.java:39) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.QueueCapacitiesInfo.(QueueCapacitiesInfo.java:61) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerLeafQueueInfo.populateQueueCapacities(CapacitySchedulerLeafQueueInfo.java:108) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerQueueInfo.(CapacitySchedulerQueueInfo.java:137) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerLeafQueueInfo.(CapacitySchedulerLeafQueueInfo.java:66) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.getQueues(CapacitySchedulerInfo.java:197) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.CapacitySchedulerInfo.(CapacitySchedulerInfo.java:94) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:399) {code} There is no need to create capacityVectors for the unconfigured labels, so a null check should solve this issue on the API side. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11584) [CS] Attempting to create Leaf Queue with empty shortname should fail without crashing RM
[ https://issues.apache.org/jira/browse/YARN-11584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11584: - Fix Version/s: 3.4.0 > [CS] Attempting to create Leaf Queue with empty shortname should fail without > crashing RM > - > > Key: YARN-11584 > URL: https://issues.apache.org/jira/browse/YARN-11584 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Brian Goerlitz >Assignee: Brian Goerlitz >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > If an app submission results in attempting to auto-create a leaf queue with > an empty short name, the app submission should be rejected without the RM > crashing. Currently, the queue will be created, but the RM encounters a FATAL > exception due to metrics collision. > For example, if an app is placed to 'root.' the RM will fail with the below. > {noformat} > 2023-09-12 20:23:43,294 FATAL org.apache.hadoop.yarn.event.EventDispatcher: > Error in handling event type APP_ADDED to the Event Dispatcher > org.apache.hadoop.metrics2.MetricsException: Metrics source > QueueMetrics,q0=root already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics.forQueue(CSQueueMetrics.java:309) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.(AbstractCSQueue.java:147) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue.(AbstractLeafQueue.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.(LeafQueue.java:42) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.createNewQueue(ParentQueue.java:495) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.addDynamicChildQueue(ParentQueue.java:563) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.addDynamicLeafQueue(ParentQueue.java:517) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.createAutoQueue(CapacitySchedulerQueueManager.java:678) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.createQueue(CapacitySchedulerQueueManager.java:511) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getOrCreateQueueFromPlacementContext(CapacityScheduler.java:898) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplication(CapacityScheduler.java:962) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1920) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:170) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.base/java.lang.Thread.run(Thread.java:834) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11578) Fix performance issue of permission check in verifyAndCreateRemoteLogDir
[ https://issues.apache.org/jira/browse/YARN-11578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11578: - Fix Version/s: 3.4.0 > Fix performance issue of permission check in verifyAndCreateRemoteLogDir > > > Key: YARN-11578 > URL: https://issues.apache.org/jira/browse/YARN-11578 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > YARN-10901 introduced a check to avoid a warn message in NN logs in certain > situations (when /tmp/logs is not owned by the yarn user), but it adds 3 > NameNode calls (create, setpermission, delete) during log aggregation > collection, for *every* NM. Meaning, when a YARN job completes, at the YARN > log aggregation phase this check is done for every job, from every > NodeManager. > In 30 minutes 4.2 % of all the NameNode calls were due to this in a cluster. > "write" calls need a Namesystem writeLock as well, so the impact is bigger. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11578) Fix performance issue of permission check in verifyAndCreateRemoteLogDir
[ https://issues.apache.org/jira/browse/YARN-11578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11578. -- Hadoop Flags: Reviewed Resolution: Fixed > Fix performance issue of permission check in verifyAndCreateRemoteLogDir > > > Key: YARN-11578 > URL: https://issues.apache.org/jira/browse/YARN-11578 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > YARN-10901 introduced a check to avoid a warn message in NN logs in certain > situations (when /tmp/logs is not owned by the yarn user), but it adds 3 > NameNode calls (create, setpermission, delete) during log aggregation > collection, for *every* NM. Meaning, when a YARN job completes, at the YARN > log aggregation phase this check is done for every job, from every > NodeManager. > In 30 minutes 4.2 % of all the NameNode calls were due to this in a cluster. > "write" calls need a Namesystem writeLock as well, so the impact is bigger. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11567) Aggregate container launch debug artifacts automatically in case of error
[ https://issues.apache.org/jira/browse/YARN-11567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11567: - Fix Version/s: 3.4.0 > Aggregate container launch debug artifacts automatically in case of error > - > > Key: YARN-11567 > URL: https://issues.apache.org/jira/browse/YARN-11567 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > In cases where a container fails to launch without writing to a log file, we > often would want to see the artifacts captured by > {{yarn.nodemanager.log-container-debug-info.enabled}} in order to better > understand the cause of the exit code. To enable this feature for every > container maybe over kill, so we need a feature flag to capture these > artifacts in case of errors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11514) Extend SchedulerResponse with capacityVector
[ https://issues.apache.org/jira/browse/YARN-11514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759092#comment-17759092 ] Benjamin Teke edited comment on YARN-11514 at 8/25/23 3:53 PM: --- After some trials I suggest moving forward with the current jsonProvider. The ideal solution would be to replace Jettison with Jackson, as it would make adding new/custom fields way easier, but achieving exactly the same result would mean a lot of code change and even custom solutions, which has more risk than benefits especially because we're adding a simple map containing 2-4 elements. And since this is a public API we risk breaking other dependent components. So the current implementation looks like this: {code:java} "queueCapacityVectorInfo" : { "configuredCapacityVector" : "[memory-mb=12.5%,vcores=12.5%]", "capacityVectorEntries" : [ { "resourceName" : "memory-mb", "resourceValue" : "12.5%" }, { "resourceName" : "vcores", "resourceValue" : "12.5%" } ] }, {code} was (Author: bteke): After some trials I suggest moving forward with the current jsonProvider. The ideal solution would be to replace Jettison with Jackson, as it would make adding new/custom fields way easier, but achieving exactly the same result would mean a lot of code change and even custom solutions, which isn't necessarily worth it. And since this is a public API we risk breaking other dependent components. So the current implementation looks like this: {code:java} "queueCapacityVectorInfo" : { "configuredCapacityVector" : "[memory-mb=12.5%,vcores=12.5%]", "capacityVectorEntries" : [ { "resourceName" : "memory-mb", "resourceValue" : "12.5%" }, { "resourceName" : "vcores", "resourceValue" : "12.5%" } ] }, {code} > Extend SchedulerResponse with capacityVector > > > Key: YARN-11514 > URL: https://issues.apache.org/jira/browse/YARN-11514 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tamas Domok >Assignee: Benjamin Teke >Priority: Major > Labels: pull-request-available > > The goal is to add the *capacityVector* to the Scheduler response (XML/JSON). > - CapacitySchedulerQueueInfo.java > - PartitionQueueCapacitiesInfo.java > The proposed format in the design doc (YARN-10888): > {code:json} > { >"capacityVector": { > "memory-mb": "30%", > "vcores": "16" >} > } > {code} > {code:xml} > > > 30% > 16 > > {code} > Unfortunately the current jsonProvider (MoxyJsonFeature or JettisonFeature > not sure) serialise map structures in the following way: > {code:json} > { > "capacityVector":{ > "entry":[ > { > "key":"memory-mb", > "value":"12288" > }, > { > "key":"vcores", > "value":"86%" > } > ] > } > } > {code} > {code:xml} > > > > memory-mb > 1288 > > > vcores > 12 > > > {code} > Based on some research with the following two dependencies we could achieve > the proposed format: > - jersey-media-json-jackson (this one is used in the apps catalog already) > - jackson-dataformat-xml > Some concerns: > - 2 more dependencies > - for the XML when the content depends on the runtime content of the map is > not XSD friendly > - name is capacityVector but it's represented in a map > An alternative could be to just store the capacityVector as a string, but > then clients needs to parse it, and it's not particularly nice either: > {code:json} > { >"capacityVector": "[\"memory-mb\": 12288, \"vcores\": 86%]" > } > {code} > {code:xml} > > [memory-mb: 12288, vcores: > 86%] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11514) Extend SchedulerResponse with capacityVector
[ https://issues.apache.org/jira/browse/YARN-11514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759092#comment-17759092 ] Benjamin Teke commented on YARN-11514: -- After some trials I suggest moving forward with the current jsonProvider. The ideal solution would be to replace Jettison with Jackson, as it would make adding new/custom fields way easier, but achieving exactly the same result would mean a lot of code change and even custom solutions, which isn't necessarily worth it. And since this is a public API we risk breaking other dependent components. So the current implementation looks like this: {code:java} "queueCapacityVectorInfo" : { "configuredCapacityVector" : "[memory-mb=12.5%,vcores=12.5%]", "capacityVectorEntries" : [ { "resourceName" : "memory-mb", "resourceValue" : "12.5%" }, { "resourceName" : "vcores", "resourceValue" : "12.5%" } ] }, {code} > Extend SchedulerResponse with capacityVector > > > Key: YARN-11514 > URL: https://issues.apache.org/jira/browse/YARN-11514 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tamas Domok >Assignee: Benjamin Teke >Priority: Major > Labels: pull-request-available > > The goal is to add the *capacityVector* to the Scheduler response (XML/JSON). > - CapacitySchedulerQueueInfo.java > - PartitionQueueCapacitiesInfo.java > The proposed format in the design doc (YARN-10888): > {code:json} > { >"capacityVector": { > "memory-mb": "30%", > "vcores": "16" >} > } > {code} > {code:xml} > > > 30% > 16 > > {code} > Unfortunately the current jsonProvider (MoxyJsonFeature or JettisonFeature > not sure) serialise map structures in the following way: > {code:json} > { > "capacityVector":{ > "entry":[ > { > "key":"memory-mb", > "value":"12288" > }, > { > "key":"vcores", > "value":"86%" > } > ] > } > } > {code} > {code:xml} > > > > memory-mb > 1288 > > > vcores > 12 > > > {code} > Based on some research with the following two dependencies we could achieve > the proposed format: > - jersey-media-json-jackson (this one is used in the apps catalog already) > - jackson-dataformat-xml > Some concerns: > - 2 more dependencies > - for the XML when the content depends on the runtime content of the map is > not XSD friendly > - name is capacityVector but it's represented in a map > An alternative could be to just store the capacityVector as a string, but > then clients needs to parse it, and it's not particularly nice either: > {code:json} > { >"capacityVector": "[\"memory-mb\": 12288, \"vcores\": 86%]" > } > {code} > {code:xml} > > [memory-mb: 12288, vcores: > 86%] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11535) Remove jackson-dataformat-yaml dependency
[ https://issues.apache.org/jira/browse/YARN-11535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11535. -- Assignee: Benjamin Teke (was: Susheel Gupta) Resolution: Fixed > Remove jackson-dataformat-yaml dependency > - > > Key: YARN-11535 > URL: https://issues.apache.org/jira/browse/YARN-11535 > Project: Hadoop YARN > Issue Type: Task > Components: build, yarn >Affects Versions: 3.4.0 >Reporter: Susheel Gupta >Assignee: Benjamin Teke >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: deps.txt > > > Hadoop-project uses > [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198] > and > [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72]. > But internally jackson-dataformat-yaml-2.12.7 uses compile dependency > [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7] > .This may cause a transitive dependency issue in other services using hadoop > jar having jackson-dataformat-yaml-2.12.7 as jackson-dataformat-yaml-2.12.7 > will use nearest dependency available of snakeyaml i.e 1.27 and ignore the > version of snakeyaml-2.0 from hadoop-project. To overcome this and since > jackson-dataformat-yaml is not actually used it should be removed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11535) Remove jackson-dataformat-yaml dependency
[ https://issues.apache.org/jira/browse/YARN-11535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11535: - Summary: Remove jackson-dataformat-yaml dependency (was: Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause transitive dependency issue with 2.12.7) > Remove jackson-dataformat-yaml dependency > - > > Key: YARN-11535 > URL: https://issues.apache.org/jira/browse/YARN-11535 > Project: Hadoop YARN > Issue Type: Task > Components: build, yarn >Affects Versions: 3.4.0 >Reporter: Susheel Gupta >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: deps.txt > > > Hadoop-project uses > [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198] > and > [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72]. > But internally jackson-dataformat-yaml-2.12.7 uses compile dependency > [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7] > .This may cause a transitive dependency issue in other services using hadoop > jar having jackson-dataformat-yaml-2.12.7 as jackson-dataformat-yaml-2.12.7 > will use nearest dependency available of snakeyaml i.e 1.27 and ignore the > version of snakeyaml-2.0 from hadoop-project. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11535) Remove jackson-dataformat-yaml dependency
[ https://issues.apache.org/jira/browse/YARN-11535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11535: - Description: Hadoop-project uses [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198] and [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72]. But internally jackson-dataformat-yaml-2.12.7 uses compile dependency [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7] .This may cause a transitive dependency issue in other services using hadoop jar having jackson-dataformat-yaml-2.12.7 as jackson-dataformat-yaml-2.12.7 will use nearest dependency available of snakeyaml i.e 1.27 and ignore the version of snakeyaml-2.0 from hadoop-project. To overcome this and since jackson-dataformat-yaml is not actually used it should be removed. was: Hadoop-project uses [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198] and [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72]. But internally jackson-dataformat-yaml-2.12.7 uses compile dependency [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7] .This may cause a transitive dependency issue in other services using hadoop jar having jackson-dataformat-yaml-2.12.7 as jackson-dataformat-yaml-2.12.7 will use nearest dependency available of snakeyaml i.e 1.27 and ignore the version of snakeyaml-2.0 from hadoop-project. > Remove jackson-dataformat-yaml dependency > - > > Key: YARN-11535 > URL: https://issues.apache.org/jira/browse/YARN-11535 > Project: Hadoop YARN > Issue Type: Task > Components: build, yarn >Affects Versions: 3.4.0 >Reporter: Susheel Gupta >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: deps.txt > > > Hadoop-project uses > [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198] > and > [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72]. > But internally jackson-dataformat-yaml-2.12.7 uses compile dependency > [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7] > .This may cause a transitive dependency issue in other services using hadoop > jar having jackson-dataformat-yaml-2.12.7 as jackson-dataformat-yaml-2.12.7 > will use nearest dependency available of snakeyaml i.e 1.27 and ignore the > version of snakeyaml-2.0 from hadoop-project. To overcome this and since > jackson-dataformat-yaml is not actually used it should be removed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11535) Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause transitive dependency issue with 2.12.7
[ https://issues.apache.org/jira/browse/YARN-11535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11535: - Attachment: deps.txt > Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause > transitive dependency issue with 2.12.7 > > > Key: YARN-11535 > URL: https://issues.apache.org/jira/browse/YARN-11535 > Project: Hadoop YARN > Issue Type: Task > Components: build, yarn >Affects Versions: 3.4.0 >Reporter: Susheel Gupta >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: deps.txt > > > Hadoop-project uses > [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198] > and > [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72]. > But internally jackson-dataformat-yaml-2.12.7 uses compile dependency > [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7] > .This may cause a transitive dependency issue in other services using hadoop > jar having jackson-dataformat-yaml-2.12.7 as jackson-dataformat-yaml-2.12.7 > will use nearest dependency available of snakeyaml i.e 1.27 and ignore the > version of snakeyaml-2.0 from hadoop-project. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11535) Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause transitive dependency issue with 2.12.7
[ https://issues.apache.org/jira/browse/YARN-11535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756905#comment-17756905 ] Benjamin Teke commented on YARN-11535: -- Since AFAIK the only reason jackson-dataformat-yaml was pulled is to solve some dependency conflicts a safer way to resolve this issue is to remove jackson-dataformat-yaml and exclude it in the one place it's transitively imported. Uploaded the new dependency:tree output after the change. I've created a PR for this: https://github.com/apache/hadoop/pull/5970/ Side-note: the only usage of snakeyaml is in Apache Slider (a long retired project), to provide a possibility for writing its config to a YAML file, but without tests it a larger effort to refactor it. Since its usecase is a simple one, it's unlikely to break with upgrades. > Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause > transitive dependency issue with 2.12.7 > > > Key: YARN-11535 > URL: https://issues.apache.org/jira/browse/YARN-11535 > Project: Hadoop YARN > Issue Type: Task > Components: build, yarn >Affects Versions: 3.4.0 >Reporter: Susheel Gupta >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: deps.txt > > > Hadoop-project uses > [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198] > and > [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72]. > But internally jackson-dataformat-yaml-2.12.7 uses compile dependency > [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7] > .This may cause a transitive dependency issue in other services using hadoop > jar having jackson-dataformat-yaml-2.12.7 as jackson-dataformat-yaml-2.12.7 > will use nearest dependency available of snakeyaml i.e 1.27 and ignore the > version of snakeyaml-2.0 from hadoop-project. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11551) RM format-conf-store should delete all the content of ZKConfigStore
Benjamin Teke created YARN-11551: Summary: RM format-conf-store should delete all the content of ZKConfigStore Key: YARN-11551 URL: https://issues.apache.org/jira/browse/YARN-11551 Project: Hadoop YARN Issue Type: Bug Reporter: Benjamin Teke Assignee: Benjamin Teke To easily overcome the issue mentioned in YARN-11075 format-conf-store should delete everything under RM_SCHEDCONF_STORE_ZK_PARENT_PATH not just the CONF_STORE, because LOGS can contain elements that cannot be deserialized because of a missing serialVersionUID. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11416) FS2CS should use CapacitySchedulerConfiguration in FSQueueConverterBuilder
[ https://issues.apache.org/jira/browse/YARN-11416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11416. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > FS2CS should use CapacitySchedulerConfiguration in FSQueueConverterBuilder > --- > > Key: YARN-11416 > URL: https://issues.apache.org/jira/browse/YARN-11416 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Benjamin Teke >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.FSQueueConverter > and its builder stores the variable capacitySchedulerConfig as a simple > Configuration object instead of CapacitySchedulerConfiguration. This is > misleading, as capacitySchedulerConfig suggests that it is indeed a > CapacitySchedulerConfiguration and it loses access to the convenience methods > to check for various properties. Because of this every time a property getter > is changed FS2CS should be checked if it reimplemented the same, otherwise > there might be behaviour differences or even bugs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11535) Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause transitive dependency issue with 2.12.7
[ https://issues.apache.org/jira/browse/YARN-11535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11535. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Jackson-dataformat-yaml should be upgraded to 2.15.2 as it may cause > transitive dependency issue with 2.12.7 > > > Key: YARN-11535 > URL: https://issues.apache.org/jira/browse/YARN-11535 > Project: Hadoop YARN > Issue Type: Task > Components: build, yarn >Affects Versions: 3.4.0 >Reporter: Susheel Gupta >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > Hadoop-project uses > [snakeyaml.version-2.0|https://github.com/apache/hadoop/blame/trunk/hadoop-project/pom.xml#L198] > and > [jackson-dataformat-yaml-2.12.7|https://github.com/apache/hadoop/blob/trunk/hadoop-project/pom.xml#L72]. > But internally jackson-dataformat-yaml-2.12.7 uses compile dependency > [snakeyaml.version-1.27|https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-yaml/2.12.7] > .This may cause a transitive dependency issue in other services using hadoop > jar having jackson-dataformat-yaml-2.12.7 as jackson-dataformat-yaml-2.12.7 > will use nearest dependency available of snakeyaml i.e 1.27 and ignore the > version of snakeyaml-2.0 from hadoop-project. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11545) FS2CS not converts ACLs when all users are allowed
[ https://issues.apache.org/jira/browse/YARN-11545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11545. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > FS2CS not converts ACLs when all users are allowed > -- > > Key: YARN-11545 > URL: https://issues.apache.org/jira/browse/YARN-11545 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Peter Szucs >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > Currently we only convert ACLs if users or groups are set. This should be > extended to check if the "allAllowed" flag is set in the AcessControlList to > be able to preserve * values also for the ACLs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11543) Fix checkstyle issues after YARN-11520
[ https://issues.apache.org/jira/browse/YARN-11543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11543: - Summary: Fix checkstyle issues after YARN-11520 (was: Fix checkstyle after YARN-11520) > Fix checkstyle issues after YARN-11520 > -- > > Key: YARN-11543 > URL: https://issues.apache.org/jira/browse/YARN-11543 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > > YARN-11520 missed some checkstyle fixes, they should resolved. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11543) Fix checkstyle after YARN-11520
Benjamin Teke created YARN-11543: Summary: Fix checkstyle after YARN-11520 Key: YARN-11543 URL: https://issues.apache.org/jira/browse/YARN-11543 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke Assignee: Benjamin Teke YARN-11520 missed some checkstyle fixes, they should resolved. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11521) Create a test set that runs with both Legacy/Uniform queue calculation
[ https://issues.apache.org/jira/browse/YARN-11521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11521: - Fix Version/s: 3.4.0 > Create a test set that runs with both Legacy/Uniform queue calculation > -- > > Key: YARN-11521 > URL: https://issues.apache.org/jira/browse/YARN-11521 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 3.4.0 >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > Follow-up ticket for YARN-11000. > The JSON assert tests in TestRMWebServicesCapacitySchedDynamicConfig are a > good candidate for this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11534) Incorrect exception handling in RecoveredContainerLaunch
[ https://issues.apache.org/jira/browse/YARN-11534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11534. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Incorrect exception handling in RecoveredContainerLaunch > > > Key: YARN-11534 > URL: https://issues.apache.org/jira/browse/YARN-11534 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Peter Szucs >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > When NM is restarted during a container recovery, it can happen that it > interrupts the container reaquisition during the LinuxContainerExecutor's > signalContainer method. In this case we will get the following exception: > {code:java} > java.io.InterruptedIOException: java.lang.InterruptedException > at org.apache.hadoop.util.Shell.runCommand(Shell.java:1011) > at org.apache.hadoop.util.Shell.run(Shell.java:901) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:177) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887) > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) > Caused by: java.lang.InterruptedException > at java.base/java.lang.Object.wait(Native Method) > at java.base/java.lang.Object.wait(Object.java:328) > at java.base/java.lang.ProcessImpl.waitFor(ProcessImpl.java:495) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:1001) > ... 15 more{code} > Later this InterruptedIOException get caught and wrapped inside a > PrivilegedOperationException and a ContainerExecutionException. In > LinuxContainerExecutor's > [signalContainer|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java#L790] > method we catch this exception again, and throw an IOException from it, > indicating this error message in the stack trace: > {code:java} > IOException from it, causing the following stack trace: > org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: > Signal container failed > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:183) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887) > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84) > at >
[jira] [Created] (YARN-11539) Flexible AQC: setting capacity with leaf-template doesn't work
Benjamin Teke created YARN-11539: Summary: Flexible AQC: setting capacity with leaf-template doesn't work Key: YARN-11539 URL: https://issues.apache.org/jira/browse/YARN-11539 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.4.0 Reporter: Benjamin Teke Assignee: Benjamin Teke In [AbstractCSQueue.setDynamicQueueProperties|https://github.com/apache/hadoop/blob/193ff1c24e55938f42cb8ca12da754a2636f62a7/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L451] we're always calling [AutoCreatedQueueTemplate.setTemplateEntriesForChild|https://github.com/apache/hadoop/blob/bf570bd4acd9ebccada80a68aa1c5fbf73ca60bf/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java#L107] with isLeaf false, hence leaf templates like capacity won't be added to the dynamic queues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11514) Extend SchedulerResponse with capacityVector
[ https://issues.apache.org/jira/browse/YARN-11514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke reassigned YARN-11514: Assignee: Benjamin Teke > Extend SchedulerResponse with capacityVector > > > Key: YARN-11514 > URL: https://issues.apache.org/jira/browse/YARN-11514 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tamas Domok >Assignee: Benjamin Teke >Priority: Major > > The goal is to add the *capacityVector* to the Scheduler response (XML/JSON). > - CapacitySchedulerQueueInfo.java > - PartitionQueueCapacitiesInfo.java > The proposed format in the design doc (YARN-10888): > {code:json} > { >"capacityVector": { > "memory-mb": "30%", > "vcores": "16" >} > } > {code} > {code:xml} > > > 30% > 16 > > {code} > Unfortunately the current jsonProvider (MoxyJsonFeature or JettisonFeature > not sure) serialise map structures in the following way: > {code:json} > { > "capacityVector":{ > "entry":[ > { > "key":"memory-mb", > "value":"12288" > }, > { > "key":"vcores", > "value":"86%" > } > ] > } > } > {code} > {code:xml} > > > > memory-mb > 1288 > > > vcores > 12 > > > {code} > Based on some research with the following two dependencies we could achieve > the proposed format: > - jersey-media-json-jackson (this one is used in the apps catalog already) > - jackson-dataformat-xml > Some concerns: > - 2 more dependencies > - for the XML when the content depends on the runtime content of the map is > not XSD friendly > - name is capacityVector but it's represented in a map > An alternative could be to just store the capacityVector as a string, but > then clients needs to parse it, and it's not particularly nice either: > {code:json} > { >"capacityVector": "[\"memory-mb\": 12288, \"vcores\": 86%]" > } > {code} > {code:xml} > > [memory-mb: 12288, vcores: > 86%] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11520) Support capacity vector for AQCv2 dynamic templates
[ https://issues.apache.org/jira/browse/YARN-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke reassigned YARN-11520: Assignee: Benjamin Teke > Support capacity vector for AQCv2 dynamic templates > --- > > Key: YARN-11520 > URL: https://issues.apache.org/jira/browse/YARN-11520 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 3.4.0 >Reporter: Tamas Domok >Assignee: Benjamin Teke >Priority: Major > > AQCv2 dynamic queue templates should support the new capacity modes. > e.g.: > {code} > auto-queue-creation-v2.parent-template.capacity = [memory=12288, vcores=86%] > auto-queue-creation-v2.leaf-template.capacity = [memory=1w, vcores=1] > auto-queue-creation-v2.template.capacity = [memory=20%, vcores=50%] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11533) CapacityScheduler CapacityConfigType changed in legacy queue allocation mode
[ https://issues.apache.org/jira/browse/YARN-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11533: - Fix Version/s: 3.4.0 > CapacityScheduler CapacityConfigType changed in legacy queue allocation mode > > > Key: YARN-11533 > URL: https://issues.apache.org/jira/browse/YARN-11533 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 3.4.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > YARN-11000 changed the CapacityConfigType determination method in legacy > queue mode, which has some undesired effects, like marking root to be in > percentage mode, when the rest of the queue structure is in absolute. > Config: > {code:java} > > > yarn.scheduler.capacity.root.maximum-capacity > [memory=40960,vcores=32] > > > yarn.scheduler.capacity.root.queues > default > > > yarn.scheduler.capacity.root.capacity > [memory=40960,vcores=32] > > > > yarn.scheduler.capacity.schedule-asynchronously.enable > true > > > > yarn.scheduler.capacity.root.default.maximum-capacity > [memory=40960,vcores=32] > > > > yarn.scheduler.capacity.root.default.maximum-am-resource-percent > 0.2 > > > yarn.scheduler.capacity.root.default.capacity > [memory=40960,vcores=32] > > > {code} > Old response: > {code:java} > { > "scheduler": { > "schedulerInfo": { > "type": "capacityScheduler", > "capacity": 100.0, > "usedCapacity": 0.0, > "maxCapacity": 100.0, > "weight": -1.0, > "normalizedWeight": 0.0, > "queueName": "root", > "queuePath": "root", > "maxParallelApps": 2147483647, > "isAbsoluteResource": true, > "queues": {...} > "queuePriority": 0, > "orderingPolicyInfo": "utilization", > "mode": "absolute", > "queueType": "parent", > "creationMethod": "static", > "autoCreationEligibility": "off", > "autoQueueTemplateProperties": {}, > "autoQueueParentTemplateProperties": {}, > "autoQueueLeafTemplateProperties": {} > } > } > } > {code} > New response: > {code:java} > { > "scheduler": { > "schedulerInfo": { > "type": "capacityScheduler", > "capacity": 100.0, > "usedCapacity": 0.0, > "maxCapacity": 100.0, > "weight": -1.0, > "normalizedWeight": 0.0, > "queueName": "root", > "queuePath": "root", > "maxParallelApps": 2147483647, > "isAbsoluteResource": false, > "queues": {...} > "queuePriority": 0, > "orderingPolicyInfo": "utilization", > "mode": "percentage", > "queueType": "parent", > "creationMethod": "static", > "autoCreationEligibility": "off", > "autoQueueTemplateProperties": {}, > "autoQueueParentTemplateProperties": {}, > "autoQueueLeafTemplateProperties": {} > } > } > } > {code} > This is misleading and has some side-effects on various clients. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9877) Intermittent TIME_OUT of LogAggregationReport
[ https://issues.apache.org/jira/browse/YARN-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-9877: Fix Version/s: 3.4.0 > Intermittent TIME_OUT of LogAggregationReport > - > > Key: YARN-9877 > URL: https://issues.apache.org/jira/browse/YARN-9877 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager, yarn >Affects Versions: 3.0.3, 3.3.0, 3.2.1, 3.1.3 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: YARN-9877.001.patch > > > I noticed some intermittent TIME_OUT in some downstream log-aggregation based > tests. > Steps to reproduce: > - Let's run a MR job > {code} > hadoop jar hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep > -Dmapreduce.job.queuename=root.default -m 10 -r 10 -mt 5000 -rt 5000 > {code} > - Suppose the AM is requesting more containers, but as soon as they're > allocated - the AM realizes it doesn't need them. The container's state > changes are: ALLOCATED -> ACQUIRED -> RELEASED. > Let's suppose these extra containers are allocated in a different node from > the other 21 (AM + 10 mapper + 10 reducer) containers' node. > - All the containers finish successfully and the app is finished successfully > as well. Log aggregation status for the whole app seemingly stucks in RUNNING > state. > - After a while the final log aggregation status for the app changes to > TIME_OUT. > Root cause: > - As unused containers are getting through the state transition in the RM's > internal representation, {{RMAppImpl$AppRunningOnNodeTransition}}'s > transition function is called. This calls the > {{RMAppLogAggregation$addReportIfNecessary}} which forcefully adds the > "NOT_START" LogAggregationStatus associated with this NodeId for the app, > even though it does not have any running container on it. > - The node's LogAggregationStatus is never updated to "SUCCEEDED" by the > NodeManager because it does not have any running container on it (Note that > the AM immediately released them after acquisition). The LogAggregationStatus > remains NOT_START until time out is reached. After that point the RM > aggregates the LogAggregationReports for all the nodes, and though all the > containers have SUCCEEDED state, one particular node has NOT_START, so the > final log aggregation will be TIME_OUT. > (I crawled the RM UI for the log aggregation statuses, and it was always > NOT_START for this particular node). > This situation is highly unlikely, but has an estimated ~0.8% of failure rate > based on a year's 1500 run on an unstressed cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11533) CapacityScheduler CapacityConfigType changed in legacy queue allocation mode
[ https://issues.apache.org/jira/browse/YARN-11533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11533: - Parent: YARN-10888 Issue Type: Sub-task (was: Bug) > CapacityScheduler CapacityConfigType changed in legacy queue allocation mode > > > Key: YARN-11533 > URL: https://issues.apache.org/jira/browse/YARN-11533 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 3.4.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > > YARN-11000 changed the CapacityConfigType determination method in legacy > queue mode, which has some undesired effects, like marking root to be in > percentage mode, when the rest of the queue structure is in absolute. > Config: > {code:java} > > > yarn.scheduler.capacity.root.maximum-capacity > [memory=40960,vcores=32] > > > yarn.scheduler.capacity.root.queues > default > > > yarn.scheduler.capacity.root.capacity > [memory=40960,vcores=32] > > > > yarn.scheduler.capacity.schedule-asynchronously.enable > true > > > > yarn.scheduler.capacity.root.default.maximum-capacity > [memory=40960,vcores=32] > > > > yarn.scheduler.capacity.root.default.maximum-am-resource-percent > 0.2 > > > yarn.scheduler.capacity.root.default.capacity > [memory=40960,vcores=32] > > > {code} > Old response: > {code:java} > { > "scheduler": { > "schedulerInfo": { > "type": "capacityScheduler", > "capacity": 100.0, > "usedCapacity": 0.0, > "maxCapacity": 100.0, > "weight": -1.0, > "normalizedWeight": 0.0, > "queueName": "root", > "queuePath": "root", > "maxParallelApps": 2147483647, > "isAbsoluteResource": true, > "queues": {...} > "queuePriority": 0, > "orderingPolicyInfo": "utilization", > "mode": "absolute", > "queueType": "parent", > "creationMethod": "static", > "autoCreationEligibility": "off", > "autoQueueTemplateProperties": {}, > "autoQueueParentTemplateProperties": {}, > "autoQueueLeafTemplateProperties": {} > } > } > } > {code} > New response: > {code:java} > { > "scheduler": { > "schedulerInfo": { > "type": "capacityScheduler", > "capacity": 100.0, > "usedCapacity": 0.0, > "maxCapacity": 100.0, > "weight": -1.0, > "normalizedWeight": 0.0, > "queueName": "root", > "queuePath": "root", > "maxParallelApps": 2147483647, > "isAbsoluteResource": false, > "queues": {...} > "queuePriority": 0, > "orderingPolicyInfo": "utilization", > "mode": "percentage", > "queueType": "parent", > "creationMethod": "static", > "autoCreationEligibility": "off", > "autoQueueTemplateProperties": {}, > "autoQueueParentTemplateProperties": {}, > "autoQueueLeafTemplateProperties": {} > } > } > } > {code} > This is misleading and has some side-effects on various clients. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11533) CapacityScheduler CapacityConfigType changed in legacy queue allocation mode
Benjamin Teke created YARN-11533: Summary: CapacityScheduler CapacityConfigType changed in legacy queue allocation mode Key: YARN-11533 URL: https://issues.apache.org/jira/browse/YARN-11533 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Affects Versions: 3.4.0 Reporter: Benjamin Teke Assignee: Benjamin Teke YARN-11000 changed the CapacityConfigType determination method in legacy queue mode, which has some undesired effects, like marking root to be in percentage mode, when the rest of the queue structure is in absolute. Config: {code:java} yarn.scheduler.capacity.root.maximum-capacity [memory=40960,vcores=32] yarn.scheduler.capacity.root.queues default yarn.scheduler.capacity.root.capacity [memory=40960,vcores=32] yarn.scheduler.capacity.schedule-asynchronously.enable true yarn.scheduler.capacity.root.default.maximum-capacity [memory=40960,vcores=32] yarn.scheduler.capacity.root.default.maximum-am-resource-percent 0.2 yarn.scheduler.capacity.root.default.capacity [memory=40960,vcores=32] {code} Old response: {code:java} { "scheduler": { "schedulerInfo": { "type": "capacityScheduler", "capacity": 100.0, "usedCapacity": 0.0, "maxCapacity": 100.0, "weight": -1.0, "normalizedWeight": 0.0, "queueName": "root", "queuePath": "root", "maxParallelApps": 2147483647, "isAbsoluteResource": true, "queues": {...} "queuePriority": 0, "orderingPolicyInfo": "utilization", "mode": "absolute", "queueType": "parent", "creationMethod": "static", "autoCreationEligibility": "off", "autoQueueTemplateProperties": {}, "autoQueueParentTemplateProperties": {}, "autoQueueLeafTemplateProperties": {} } } } {code} New response: {code:java} { "scheduler": { "schedulerInfo": { "type": "capacityScheduler", "capacity": 100.0, "usedCapacity": 0.0, "maxCapacity": 100.0, "weight": -1.0, "normalizedWeight": 0.0, "queueName": "root", "queuePath": "root", "maxParallelApps": 2147483647, "isAbsoluteResource": false, "queues": {...} "queuePriority": 0, "orderingPolicyInfo": "utilization", "mode": "percentage", "queueType": "parent", "creationMethod": "static", "autoCreationEligibility": "off", "autoQueueTemplateProperties": {}, "autoQueueParentTemplateProperties": {}, "autoQueueLeafTemplateProperties": {} } } } {code} This is misleading and has some side-effects on various clients. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11464) TestFSQueueConverter#testAutoCreateV2FlagsInWeightMode has a missing dot before auto-queue-creation-v2.enabled for method call assertNoValueForQueues
[ https://issues.apache.org/jira/browse/YARN-11464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11464. -- Fix Version/s: 3.4.0 Resolution: Fixed > TestFSQueueConverter#testAutoCreateV2FlagsInWeightMode has a missing dot > before auto-queue-creation-v2.enabled for method call assertNoValueForQueues > - > > Key: YARN-11464 > URL: https://issues.apache.org/jira/browse/YARN-11464 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.3.4 >Reporter: Susheel Gupta >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > This testcase clearly reproduces the issue. There is a missing dot before > "auto-queue-creation-v2.enabled" for method call assertNoValueForQueues. > {code:java} > @Test > public void testAutoCreateV2FlagsInWeightMode() { > converter = builder.withPercentages(false).build(); > converter.convertQueueHierarchy(rootQueue); > assertTrue("root autocreate v2 flag", > csConfig.getBoolean( > PREFIX + "root.auto-queue-creation-v2.enabled", false)); > assertTrue("root.admins autocreate v2 flag", > csConfig.getBoolean( > PREFIX + "root.admins.auto-queue-creation-v2.enabled", false)); > assertTrue("root.users autocreate v2 flag", > csConfig.getBoolean( > PREFIX + "root.users.auto-queue-creation-v2.enabled", false)); > assertTrue("root.misc autocreate v2 flag", > csConfig.getBoolean( > PREFIX + "root.misc.auto-queue-creation-v2.enabled", false)); > Set leafs = Sets.difference(ALL_QUEUES, > Sets.newHashSet("root", > "root.default", > "root.admins", > "root.users", > "root.misc")); > assertNoValueForQueues(leafs, "auto-queue-creation-v2.enabled", > csConfig); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11513) Applications submitted to ambiguous queue fail during recovery if "Specified" Placement Rule is used
[ https://issues.apache.org/jira/browse/YARN-11513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11513. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Applications submitted to ambiguous queue fail during recovery if "Specified" > Placement Rule is used > > > Key: YARN-11513 > URL: https://issues.apache.org/jira/browse/YARN-11513 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.3.4 >Reporter: Susheel Gupta >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > When an app is submitted to an ambiguous queue using the full queue path and > is placed in that pool via a {{%specified}} mapping Placement Rule, the queue > in the stored ApplicationSubmissionContext will be the short name for the > queue. During recovery from an RM failover, the placement rule will be > evaluated using the stored short name of the queue, resulting in the RM > killing the app as it cannot resolve the ambiguous queue name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11498) Exclude Jettison from jersey-json artifact in hadoop-yarn-common's pom.xml
[ https://issues.apache.org/jira/browse/YARN-11498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11498. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Exclude Jettison from jersey-json artifact in hadoop-yarn-common's pom.xml > -- > > Key: YARN-11498 > URL: https://issues.apache.org/jira/browse/YARN-11498 > Project: Hadoop YARN > Issue Type: Task > Components: build >Reporter: Devaspati Krishnatri >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > This exclusion is done to reduce CVEs present due to an older version of > Jettison(1.1) being pulled in with jersey-json artifact. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11511) Improve TestRMWebServices test config and data
[ https://issues.apache.org/jira/browse/YARN-11511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11511. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Improve TestRMWebServices test config and data > -- > > Key: YARN-11511 > URL: https://issues.apache.org/jira/browse/YARN-11511 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 3.4.0 >Reporter: Tamas Domok >Assignee: Bence Kosztolnik >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > We found multiple configuration issues with > *TestRMWebServicesCapacitySched.java* and > *TestRMWebServicesCapacitySchedDynamicConfig.java*. > h3. 1. createMockRM created the RM with a non-intuitive queue config > (createMockRM was used from the TestRMWebServicesCapacitySchedDynamicConfig > where this was not expected) > Fix: > {code} > diff --git > a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesCapacitySched.java > > b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesCapacitySched.java > index ec65237fa6e..378f16e981a 100644 > --- > a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesCapacitySched.java > +++ > b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesCapacitySched.java > @@ -108,13 +108,13 @@ public TestRMWebServicesCapacitySched() { >@Override >public void setUp() throws Exception { > super.setUp(); > -rm = createMockRM(new CapacitySchedulerConfiguration( > -new Configuration(false))); > +rm = createMockRM(setupQueueConfiguration(new > CapacitySchedulerConfiguration( > +new Configuration(false; > GuiceServletConfig.setInjector( > Guice.createInjector(new WebServletModule(rm))); >} > - public static void setupQueueConfiguration( > + public CapacitySchedulerConfiguration setupQueueConfiguration( >CapacitySchedulerConfiguration config) { > // Define top-level queues > @@ -167,6 +167,8 @@ public static void setupQueueConfiguration( > config.setAutoCreateChildQueueEnabled(a1C, true); > config.setInt(PREFIX + a1C + DOT + > AUTO_CREATED_LEAF_QUEUE_TEMPLATE_PREFIX > + DOT + CAPACITY, 50); > + > +return config; >} >@Test > @@ -407,7 +409,6 @@ public static WebAppDescriptor createWebAppDescriptor() { >} >public static MockRM createMockRM(CapacitySchedulerConfiguration csConf) { > -setupQueueConfiguration(csConf); > YarnConfiguration conf = new YarnConfiguration(csConf); > conf.setClass(YarnConfiguration.RM_SCHEDULER, CapacityScheduler.class, > ResourceScheduler.class); > {code} > h3. 2. setupQueueConfiguration creates a mixed queue hierarchy (percentage > and absolute) > {code} > final String c = CapacitySchedulerConfiguration.ROOT + ".c"; > config.setCapacity(c, "[memory=1024]"); > {code} > root.c is configured in absolute mode while root.a and root.b are configured > in percentage > setupQueueConfiguration should be simplified, do the configuration like the > other tests (create a map with the string key value pairs) > h3. 3. createAbsoluteConfigLegacyAutoCreation does not set capacity for the > default queue > That makes it mixed (percentage + absolute) > h3. 4. initAutoQueueHandler is called with wrong units > The * GB is unnecessary, and the vcores should be configured too with a value > that makes sense. > h3. 5. CSConfigGenerator static class makes the tests hard to read. > The test cases should just have their configuration assembled in them. > h3. 6. testSchedulerResponseAbsoluteMode does not add any node > No cluster resource -> no effectiveMin/Max resource. > h1. Proposal > These tests need a rework, the configurations should be easy to follow and > the calculated effectiveMin/Max (and any other calculated value) should > result in reasonable numbers. The goal is to have a end to end like test > suite that verifies the queue hierarchy initialisation. > The queue hierarchies should be simple but at least 2 level, e.g.: > root.default > root.test_1.test_1_1 > root.test_1.test_1_2 > root.test_2 > The helper methods could be moved to a separate utility class from > TestRMWebServicesCapacitySched. > TestRMWebServicesCapacitySched can be kept for the basic tests
[jira] [Created] (YARN-11503) Adding queues separately in short succession with Mutation API will stop CS allocating new containers
Benjamin Teke created YARN-11503: Summary: Adding queues separately in short succession with Mutation API will stop CS allocating new containers Key: YARN-11503 URL: https://issues.apache.org/jira/browse/YARN-11503 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Affects Versions: 3.4.0 Reporter: Benjamin Teke Adding multiple queues in short succession via Mutation API will result in some race condition when adding the partition metrics for those queues, as noted by the unhandled exception: {code:java} 2023-05-09 18:25:36,301 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Initializing root.eca_m 2023-05-09 18:25:36,301 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager: Initialized queue: root.eca_m 2023-05-09 18:25:36,359 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: LeafQueue:root.eca_mupdate max app related, maxApplications=1000, maxApplicationsPerUser=1000, Abs Cap:0.0, Cap: 0.0, MaxCap : 1.0 2023-05-09 18:25:36,359 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: LeafQueue:root.eca_mupdate max app related, maxApplications=1000, maxApplicationsPerUser=1000, Abs Cap:NaN, Cap: NaN, MaxCap : NaN 2023-05-09 18:25:36,401 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Initializing root.eca_m 2023-05-09 18:25:36,401 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager: Initialized queue: root.eca_m 2023-05-09 18:25:36,484 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-26,5,main] threw an Exception. org.apache.hadoop.metrics2.MetricsException: Metrics source PartitionQueueMetrics,partition=,q0=root,q1=eca_m already exists! 2023-05-09 18:25:36,531 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Initializing root.eca_m 2023-05-09 18:25:36,531 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: root: re-configured queue: root.eca_m: capacity=0.0, absoluteCapacity=0.0, usedResources=, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=0, numContainers=0, effectiveMinResource= , effectiveMaxResource= {code} Initing the leaf queue root.eca_m should only happen once in during a reinit (twice if the validation endpoint is used), but in this case it happened thrice under a quarter of a second. This results in an unhandled exception in the async scheduling thread, which then will block new container allocation (existing ones can transition to other states however). {code:java} 2023-05-09 18:25:36,484 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-26,5,main] threw an Exception. org.apache.hadoop.metrics2.MetricsException: Metrics source PartitionQueueMetrics,partition=,q0=root,q1=eca_m already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:355) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToUser(QueueMetrics.java:614) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1545) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1198) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1109) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:927) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605) {code} Even though
[jira] [Updated] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade
[ https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11312: - Fix Version/s: 3.2.5 3.3.6 > [UI2] Refresh buttons don't work after EmberJS upgrade > -- > > Key: YARN-11312 > URL: https://issues.apache.org/jira/browse/YARN-11312 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Brian Goerlitz >Assignee: Susheel Gupta >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.2.5, 3.3.6 > > > After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh > buttons do not work anymore. The following error is thrown in the Chrome > console, but other browsers also fail. > {noformat} > yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined > (reading 'send') > at Class.refresh (yarn-ui.js:38:311) > at Class.send (vendor.js:2504:107) > at Class.superWrapper [as send] (vendor.js:1875:112) > at vendor.js:1165:144 > at Object.flaggedInstrument (vendor.js:1583:187) > at runRegisteredAction (vendor.js:1165:68) > at Backburner.run (vendor.js:738:228) > at Object.run [as default] (vendor.js:1840:517) > at Object.handler (vendor.js:1164:178) > at HTMLButtonElement. (vendor.js:2534:128){noformat} > Downgrading the ember version to 2.7.0 seems to resolve the issue, but this > also requires a jquery downgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade
[ https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11312. -- Resolution: Fixed > [UI2] Refresh buttons don't work after EmberJS upgrade > -- > > Key: YARN-11312 > URL: https://issues.apache.org/jira/browse/YARN-11312 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Brian Goerlitz >Assignee: Susheel Gupta >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0, 3.2.5, 3.3.6 > > > After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh > buttons do not work anymore. The following error is thrown in the Chrome > console, but other browsers also fail. > {noformat} > yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined > (reading 'send') > at Class.refresh (yarn-ui.js:38:311) > at Class.send (vendor.js:2504:107) > at Class.superWrapper [as send] (vendor.js:1875:112) > at vendor.js:1165:144 > at Object.flaggedInstrument (vendor.js:1583:187) > at runRegisteredAction (vendor.js:1165:68) > at Backburner.run (vendor.js:738:228) > at Object.run [as default] (vendor.js:1840:517) > at Object.handler (vendor.js:1164:178) > at HTMLButtonElement. (vendor.js:2534:128){noformat} > Downgrading the ember version to 2.7.0 seems to resolve the issue, but this > also requires a jquery downgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade
[ https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11312: - Target Version/s: 3.4.0, 3.2.5, 3.3.6 (was: 3.4.0) > [UI2] Refresh buttons don't work after EmberJS upgrade > -- > > Key: YARN-11312 > URL: https://issues.apache.org/jira/browse/YARN-11312 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Brian Goerlitz >Assignee: Susheel Gupta >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh > buttons do not work anymore. The following error is thrown in the Chrome > console, but other browsers also fail. > {noformat} > yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined > (reading 'send') > at Class.refresh (yarn-ui.js:38:311) > at Class.send (vendor.js:2504:107) > at Class.superWrapper [as send] (vendor.js:1875:112) > at vendor.js:1165:144 > at Object.flaggedInstrument (vendor.js:1583:187) > at runRegisteredAction (vendor.js:1165:68) > at Backburner.run (vendor.js:738:228) > at Object.run [as default] (vendor.js:1840:517) > at Object.handler (vendor.js:1164:178) > at HTMLButtonElement. (vendor.js:2534:128){noformat} > Downgrading the ember version to 2.7.0 seems to resolve the issue, but this > also requires a jquery downgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade
[ https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722190#comment-17722190 ] Benjamin Teke commented on YARN-11312: -- Reopening for 3.2 and 3.3 backports. > [UI2] Refresh buttons don't work after EmberJS upgrade > -- > > Key: YARN-11312 > URL: https://issues.apache.org/jira/browse/YARN-11312 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Brian Goerlitz >Assignee: Susheel Gupta >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh > buttons do not work anymore. The following error is thrown in the Chrome > console, but other browsers also fail. > {noformat} > yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined > (reading 'send') > at Class.refresh (yarn-ui.js:38:311) > at Class.send (vendor.js:2504:107) > at Class.superWrapper [as send] (vendor.js:1875:112) > at vendor.js:1165:144 > at Object.flaggedInstrument (vendor.js:1583:187) > at runRegisteredAction (vendor.js:1165:68) > at Backburner.run (vendor.js:738:228) > at Object.run [as default] (vendor.js:1840:517) > at Object.handler (vendor.js:1164:178) > at HTMLButtonElement. (vendor.js:2534:128){noformat} > Downgrading the ember version to 2.7.0 seems to resolve the issue, but this > also requires a jquery downgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade
[ https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke reopened YARN-11312: -- > [UI2] Refresh buttons don't work after EmberJS upgrade > -- > > Key: YARN-11312 > URL: https://issues.apache.org/jira/browse/YARN-11312 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Brian Goerlitz >Assignee: Susheel Gupta >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh > buttons do not work anymore. The following error is thrown in the Chrome > console, but other browsers also fail. > {noformat} > yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined > (reading 'send') > at Class.refresh (yarn-ui.js:38:311) > at Class.send (vendor.js:2504:107) > at Class.superWrapper [as send] (vendor.js:1875:112) > at vendor.js:1165:144 > at Object.flaggedInstrument (vendor.js:1583:187) > at runRegisteredAction (vendor.js:1165:68) > at Backburner.run (vendor.js:738:228) > at Object.run [as default] (vendor.js:1840:517) > at Object.handler (vendor.js:1164:178) > at HTMLButtonElement. (vendor.js:2534:128){noformat} > Downgrading the ember version to 2.7.0 seems to resolve the issue, but this > also requires a jquery downgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade
[ https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11312. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Target Version/s: 3.4.0 Resolution: Fixed > [UI2] Refresh buttons don't work after EmberJS upgrade > -- > > Key: YARN-11312 > URL: https://issues.apache.org/jira/browse/YARN-11312 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Brian Goerlitz >Assignee: Susheel Gupta >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh > buttons do not work anymore. The following error is thrown in the Chrome > console, but other browsers also fail. > {noformat} > yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined > (reading 'send') > at Class.refresh (yarn-ui.js:38:311) > at Class.send (vendor.js:2504:107) > at Class.superWrapper [as send] (vendor.js:1875:112) > at vendor.js:1165:144 > at Object.flaggedInstrument (vendor.js:1583:187) > at runRegisteredAction (vendor.js:1165:68) > at Backburner.run (vendor.js:738:228) > at Object.run [as default] (vendor.js:1840:517) > at Object.handler (vendor.js:1164:178) > at HTMLButtonElement. (vendor.js:2534:128){noformat} > Downgrading the ember version to 2.7.0 seems to resolve the issue, but this > also requires a jquery downgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11312) [UI2] Refresh buttons don't work after EmberJS upgrade
[ https://issues.apache.org/jira/browse/YARN-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722154#comment-17722154 ] Benjamin Teke commented on YARN-11312: -- Background info on the change: https://github.com/emberjs/ember.js/issues/14168 > [UI2] Refresh buttons don't work after EmberJS upgrade > -- > > Key: YARN-11312 > URL: https://issues.apache.org/jira/browse/YARN-11312 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Brian Goerlitz >Assignee: Susheel Gupta >Priority: Minor > Labels: pull-request-available > > After YARN-10826 and YARN-10858, UI2 uses EmberJS 2.8.0, but the refresh > buttons do not work anymore. The following error is thrown in the Chrome > console, but other browsers also fail. > {noformat} > yarn-ui.js:38 Uncaught TypeError: Cannot read properties of undefined > (reading 'send') > at Class.refresh (yarn-ui.js:38:311) > at Class.send (vendor.js:2504:107) > at Class.superWrapper [as send] (vendor.js:1875:112) > at vendor.js:1165:144 > at Object.flaggedInstrument (vendor.js:1583:187) > at runRegisteredAction (vendor.js:1165:68) > at Backburner.run (vendor.js:738:228) > at Object.run [as default] (vendor.js:1840:517) > at Object.handler (vendor.js:1164:178) > at HTMLButtonElement. (vendor.js:2534:128){noformat} > Downgrading the ember version to 2.7.0 seems to resolve the issue, but this > also requires a jquery downgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11463) Node Labels root directory creation doesn't have a retry logic
[ https://issues.apache.org/jira/browse/YARN-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11463: - Fix Version/s: 3.4.0 > Node Labels root directory creation doesn't have a retry logic > -- > > Key: YARN-11463 > URL: https://issues.apache.org/jira/browse/YARN-11463 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Benjamin Teke >Assignee: Ashutosh Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > When CS is initialized, it'll [try to create the configured node labels root > dir|https://github.com/apache/hadoop/blob/7169ec450957e5602775c3cd6fe1bf0b95773dfb/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/store/AbstractFSNodeStore.java#L69]. > This however doesn't implement any kind of retry logic (in contrast to the > RM FS state store or ZK state store), hence if the distributed file system is > unavailable at the exact moment CS tries to start it'll fail. A retry logic > could be implemented to improve the robustness of the startup process. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11450) Improvements for TestYarnConfigurationFields and TestConfigurationFieldsBase
[ https://issues.apache.org/jira/browse/YARN-11450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11450: - Fix Version/s: 3.4.0 > Improvements for TestYarnConfigurationFields and TestConfigurationFieldsBase > > > Key: YARN-11450 > URL: https://issues.apache.org/jira/browse/YARN-11450 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11459) Consider changing label called "max resource" on UIv1 and UIv2
[ https://issues.apache.org/jira/browse/YARN-11459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11459. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Consider changing label called "max resource" on UIv1 and UIv2 > -- > > Key: YARN-11459 > URL: https://issues.apache.org/jira/browse/YARN-11459 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Riya Khandelwal >Assignee: Riya Khandelwal >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > Related discussion is in > [-ENGESC-16432-|https://jira.cloudera.com/browse/ENGESC-16432] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11463) Node Labels root directory creation doesn't have a retry logic
[ https://issues.apache.org/jira/browse/YARN-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11463. -- Hadoop Flags: Reviewed Target Version/s: 3.4.0 Resolution: Fixed > Node Labels root directory creation doesn't have a retry logic > -- > > Key: YARN-11463 > URL: https://issues.apache.org/jira/browse/YARN-11463 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Benjamin Teke >Assignee: Ashutosh Gupta >Priority: Major > Labels: pull-request-available > > When CS is initialized, it'll [try to create the configured node labels root > dir|https://github.com/apache/hadoop/blob/7169ec450957e5602775c3cd6fe1bf0b95773dfb/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/store/AbstractFSNodeStore.java#L69]. > This however doesn't implement any kind of retry logic (in contrast to the > RM FS state store or ZK state store), hence if the distributed file system is > unavailable at the exact moment CS tries to start it'll fail. A retry logic > could be implemented to improve the robustness of the startup process. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11463) Node Labels root directory creation doesn't have a retry logic
Benjamin Teke created YARN-11463: Summary: Node Labels root directory creation doesn't have a retry logic Key: YARN-11463 URL: https://issues.apache.org/jira/browse/YARN-11463 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler Reporter: Benjamin Teke When CS is initialized, it'll [try to create the configured node labels root dir|https://github.com/apache/hadoop/blob/7169ec450957e5602775c3cd6fe1bf0b95773dfb/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/store/AbstractFSNodeStore.java#L69]. This however doesn't implement any kind of retry logic (in contrast to the RM FS state store or ZK state store), hence if the distributed file system is unavailable at the exact moment CS tries to start it'll fail. A retry logic could be implemented to improve the robustness of the startup process. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org