[jira] [Updated] (YARN-11743) Cgroup v2 support should fall back to v1 when there are no v2 controllers

2024-12-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11743:
--
Labels: pull-request-available  (was: )

> Cgroup v2 support should fall back to v1 when there are no v2 controllers
> -
>
> Key: YARN-11743
> URL: https://issues.apache.org/jira/browse/YARN-11743
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Benjamin Teke
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Cgroup v1/v2 mixed mode support was introduced in YARN-11692, however it does 
> not support an edgecase where NM has the cgroup v2 support enabled (using 
> {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} set to 
> true), but there are only cgroup v1 controllers mounted. In larger clusters 
> there is a chance that some part of the cluster is already on newer OSes with 
> cgroup v2 as a default, and others are still using v1. 
> Currently trying to launch an NM with cgroup v2 support enabled will fail if 
> there are no cgroup.controllers file present:
> {code:java}
> Failed to initialize controller paths! Exception: 
> java.io.IOException: No cgroup controllers file found in the directory 
> specified: /var/lib/yarn-ce/cgroups
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsV2HandlerImpl.readControllersFile(CGroupsV2HandlerImpl.java:130)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsV2HandlerImpl.parsePreConfiguredMountPath(CGroupsV2HandlerImpl.java:101)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.AbstractCGroupsHandler.initializeControllerPaths(AbstractCGroupsHandler.java:133)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.AbstractCGroupsHandler.init(AbstractCGroupsHandler.java:107)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.AbstractCGroupsHandler.(AbstractCGroupsHandler.java:103)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsV2HandlerImpl.(CGroupsV2HandlerImpl.java:71)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsV2HandlerImpl.(CGroupsV2HandlerImpl.java:83)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.initializeCGroupV2Handler(ResourceHandlerModule.java:106)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.initializeCGroupHandlers(ResourceHandlerModule.java:83)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.initCGroupsCpuResourceHandler(ResourceHandlerModule.java:177)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.initializeConfiguredResourceHandlerChain(ResourceHandlerModule.java:334)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.getConfiguredResourceHandlerChain(ResourceHandlerModule.java:383)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:314)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:427)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:974)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1054)
> {code}
> Basically 
> [readControllersFile|https://github.com/apache/hadoop/blob/950b2ff773fa828eb13bed7c3fe6b3d52c7fff18/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsV2HandlerImpl.java#L127]'s
>  thrown error should be handled if the required controllers are mounted in v1:
> {code:java}
>   /**
>* Parse the cgroup v2 controllers file (cgroup.controllers) to check the 
> enabled controllers.
>* @param cgroupPath path to the cgroup directory
>* @return set of enabled and YARN supported controllers.
>* @throws IOException if the file is not found or cannot be read
>*/
>   public Set readControllersFile(String cgroupPath) throws 
> IOException {
> File cgroupControllersFile = new File(cgroupPath + Path.SEPARATOR + 
> CGROUP_CONTROLLERS_FILE);
> if (!cgroupControllersFile.exists()) {
>   throw new IOException("No cgroup controllers file found in the 
> 

[jira] [Updated] (YARN-11743) Cgroup v2 support should fall back to v1 when there are no v2 controllers

2024-12-10 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11743:
-
Description: 
Cgroup v1/v2 mixed mode support was introduced in YARN-11692, however it does 
not support an edgecase where NM has the cgroup v2 support enabled (using 
{{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} set to true), 
but there are only cgroup v1 controllers mounted. In larger clusters there is a 
chance that some part of the cluster is already on newer OSes with cgroup v2 as 
a default, and others are still using v1. 

Currently trying to launch an NM with cgroup v2 support enabled will fail if 
there are no cgroup.controllers file present:

{code:java}
Failed to initialize controller paths! Exception: 
java.io.IOException: No cgroup controllers file found in the directory 
specified: /var/lib/yarn-ce/cgroups
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsV2HandlerImpl.readControllersFile(CGroupsV2HandlerImpl.java:130)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsV2HandlerImpl.parsePreConfiguredMountPath(CGroupsV2HandlerImpl.java:101)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.AbstractCGroupsHandler.initializeControllerPaths(AbstractCGroupsHandler.java:133)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.AbstractCGroupsHandler.init(AbstractCGroupsHandler.java:107)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.AbstractCGroupsHandler.(AbstractCGroupsHandler.java:103)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsV2HandlerImpl.(CGroupsV2HandlerImpl.java:71)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsV2HandlerImpl.(CGroupsV2HandlerImpl.java:83)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.initializeCGroupV2Handler(ResourceHandlerModule.java:106)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.initializeCGroupHandlers(ResourceHandlerModule.java:83)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.initCGroupsCpuResourceHandler(ResourceHandlerModule.java:177)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.initializeConfiguredResourceHandlerChain(ResourceHandlerModule.java:334)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.getConfiguredResourceHandlerChain(ResourceHandlerModule.java:383)
at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:314)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:427)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:974)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1054)
{code}

Basically 
[readControllersFile|https://github.com/apache/hadoop/blob/950b2ff773fa828eb13bed7c3fe6b3d52c7fff18/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsV2HandlerImpl.java#L127]'s
 thrown error should be handled if the required controllers are mounted in v1.

  was:
Cgroup v1/v2 mixed mode support was introduced in YARN-11692, however it does 
not support an edgecase where NM has the cgroup v2 support enabled (using 
{{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} set to true), 
but there are only cgroup v1 controllers mounted. In larger clusters there is a 
chance that some part of the cluster is already on newer OSes with cgroup v2 as 
a default, and others are still using v1. 

Currently trying to launch an NM with cgroup v2 support enabled will fail if 
there are no cgroup.controllers file present.


> Cgroup v2 support should fall back to v1 when there are no v2 controllers
> -
>
> Key: YARN-11743
> URL: https://issues.apache.org/jira/browse/YARN-11743
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Benjamin Teke
>Priority: Major
>
> Cgroup v1/v2 mixed mode support was introduced in YARN-11692, however it does 
> not support an edgecase where NM has the cgroup v2 support enabled (using 
> {{yarn.nodemanager.linux-container-executor.cgroups.v2.en

[jira] [Updated] (YARN-11743) Cgroup v2 support should fall back to v1 when there are no v2 controllers

2024-12-10 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11743:
-
Description: 
Cgroup v1/v2 mixed mode support was introduced in YARN-11692, however it does 
not support an edgecase where NM has the cgroup v2 support enabled (using 
{{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} set to true), 
but there are only cgroup v1 controllers mounted. In larger clusters there is a 
chance that some part of the cluster is already on newer OSes with cgroup v2 as 
a default, and others are still using v1. 

Currently trying to launch an NM with cgroup v2 support enabled will fail if 
there are no cgroup.controllers file present:

{code:java}
Failed to initialize controller paths! Exception: 
java.io.IOException: No cgroup controllers file found in the directory 
specified: /var/lib/yarn-ce/cgroups
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsV2HandlerImpl.readControllersFile(CGroupsV2HandlerImpl.java:130)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsV2HandlerImpl.parsePreConfiguredMountPath(CGroupsV2HandlerImpl.java:101)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.AbstractCGroupsHandler.initializeControllerPaths(AbstractCGroupsHandler.java:133)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.AbstractCGroupsHandler.init(AbstractCGroupsHandler.java:107)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.AbstractCGroupsHandler.(AbstractCGroupsHandler.java:103)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsV2HandlerImpl.(CGroupsV2HandlerImpl.java:71)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsV2HandlerImpl.(CGroupsV2HandlerImpl.java:83)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.initializeCGroupV2Handler(ResourceHandlerModule.java:106)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.initializeCGroupHandlers(ResourceHandlerModule.java:83)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.initCGroupsCpuResourceHandler(ResourceHandlerModule.java:177)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.initializeConfiguredResourceHandlerChain(ResourceHandlerModule.java:334)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule.getConfiguredResourceHandlerChain(ResourceHandlerModule.java:383)
at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:314)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:427)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:974)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1054)
{code}

Basically 
[readControllersFile|https://github.com/apache/hadoop/blob/950b2ff773fa828eb13bed7c3fe6b3d52c7fff18/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsV2HandlerImpl.java#L127]'s
 thrown error should be handled if the required controllers are mounted in v1:


{code:java}
  /**
   * Parse the cgroup v2 controllers file (cgroup.controllers) to check the 
enabled controllers.
   * @param cgroupPath path to the cgroup directory
   * @return set of enabled and YARN supported controllers.
   * @throws IOException if the file is not found or cannot be read
   */
  public Set readControllersFile(String cgroupPath) throws IOException {
File cgroupControllersFile = new File(cgroupPath + Path.SEPARATOR + 
CGROUP_CONTROLLERS_FILE);
if (!cgroupControllersFile.exists()) {
  throw new IOException("No cgroup controllers file found in the directory 
specified: " +
  cgroupPath);
}

String enabledControllers = 
FileUtils.readFileToString(cgroupControllersFile,
StandardCharsets.UTF_8);
Set validCGroups = getValidCGroups();
Set controllerSet =
new HashSet<>(Arrays.asList(enabledControllers.split(" ")));
// Collect the valid subsystem names
controllerSet.retainAll(validCGroups);
if (controllerSet.isEmpty()) {
  LOG.warn("The following cgroup directory doesn't contain any supported 
controllers: " +
  cgroupPath);
}

return controllerSet;
  }
{code}


  was:
Cgroup v1/v2 mixe