[ 
https://issues.apache.org/jira/browse/YARN-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shurong Mai updated YARN-9518:
------------------------------
    Description: 
When I had set configuration variables  for cgroup with yarn, nodemanager could 
be start without any matter. But when I ran a job, the job failed with these 
exceptional nodemanager logs in the end.

In these logs, the important logs is " Can't open file /sys/fs/cgroup/cpu as 
node manager - Is a directory "

After I analysed, I found the reason. In centos6, the cgroup "cpu" and 
"cpuacct" subsystem are as follows: 
{code:java}
/sys/fs/cgroup/cpu
/sys/fs/cgroup/cpuacct
{code}
But in centos7, as follows:
{code:java}
/sys/fs/cgroup/cpu -> cpu,cpuacct
/sys/fs/cgroup/cpuacct -> cpu,cpuacct
/sys/fs/cgroup/cpu,cpuacct{code}
"cpu" and "cpuacct" have merge as "cpu,cpuacct".  "cpu"  and  "cpuacct"  are 
symbol links. 

As I look at source code, nodemamager get the cgroup subsystem info by reading 
/proc/mounts. So It get the cpu and cpuacct subsystem path are also 
"/sys/fs/cgroup/cpu,cpuacct". 

The resource description arguments of container-executor is such as follows:

 
{code:java}
cgroups=/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_000001/tasks
{code}
There is a comma in the cgroup path, but the comma is separator of multi 
resource. Therefore, the cgroup path is truncated as "/sys.fs/cgroup/cpu" 
rather than correct cgroup path " 
/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_000001/tasks
 "

 

 

 
{panel:title=exceptional nodemanager logs}
2019-04-19 20:17:20,095 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_1554210318404_0042_01_000001 transitioned from LOCALIZED 
to RUNNING
 2019-04-19 20:17:20,101 WARN 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
from container container_1554210318404_0042_01_000001 is : 27
 2019-04-19 20:17:20,103 WARN 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception 
from container-launch with container ID: container_155421031840
 4_0042_01_000001 and exit code: 27
 ExitCodeException exitCode=27:
         at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
         at org.apache.hadoop.util.Shell.run(Shell.java:482)
         at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
         at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:299)
         at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
         at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
         at java.lang.Thread.run(Thread.java:745)
 2019-04-19 20:17:20,108 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
container-launch.
 2019-04-19 20:17:20,108 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: 
container_1554210318404_0042_01_000001
 2019-04-19 20:17:20,108 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 27
 2019-04-19 20:17:20,108 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: 
ExitCodeException exitCode=27:
 2019-04-19 20:17:20,108 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
 2019-04-19 20:17:20,108 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
org.apache.hadoop.util.Shell.run(Shell.java:482)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:299)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
java.lang.Thread.run(Thread.java:745)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell output: main 
: command provided 1
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is 
test_hadoop
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : requested 
yarn user is datadev
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to cgroup 
task files...
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Can't open file 
/sys/fs/cgroup/cpu as node manager - Is a directory
 2019-04-19 20:17:20,131 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Container exited with a non-zero exit code 27
 2019-04-19 20:17:20,133 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_1554210318404_0042_01_000001 transitioned from RUNNING to 
EXITED_WITH_FAILURE
 2019-04-19 20:17:20,133 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_1554210318404_0042_01_000001
  
{panel}

  was:
When I had set configuration variables  for cgroup with yarn, nodemanager could 
be start without any matter. But when I ran a job, the job failed with these 
exceptional nodemanager logs in the end.

In these logs, the important logs is " Can't open file /sys/fs/cgroup/cpu as 
node manager - Is a directory "

After I analysed, I found the reason. In centos6, the cgroup "cpu" and 
"cpuacct" subsystem are as follows: 
{code:java}
/sys/fs/cgroup/cpu
/sys/fs/cgroup/cpuacct
{code}
But in centos7, as follows:
{code:java}
/sys/fs/cgroup/cpu -> cpu,cpuacct
/sys/fs/cgroup/cpuacct -> cpu,cpuacct
/sys/fs/cgroup/cpu,cpuacct{code}
"cpu" and "cpuacct" have merge as "cpu,cpuacct".  "cpu"  and  "cpuacct"  are 
symbol links.  

 

 
{panel:title=exceptional nodemanager logs}
2019-04-19 20:17:20,095 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_1554210318404_0042_01_000001 transitioned from LOCALIZED 
to RUNNING
 2019-04-19 20:17:20,101 WARN 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
from container container_1554210318404_0042_01_000001 is : 27
 2019-04-19 20:17:20,103 WARN 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception 
from container-launch with container ID: container_155421031840
 4_0042_01_000001 and exit code: 27
 ExitCodeException exitCode=27:
         at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
         at org.apache.hadoop.util.Shell.run(Shell.java:482)
         at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
         at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:299)
         at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
         at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
         at java.lang.Thread.run(Thread.java:745)
 2019-04-19 20:17:20,108 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
container-launch.
 2019-04-19 20:17:20,108 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: 
container_1554210318404_0042_01_000001
 2019-04-19 20:17:20,108 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 27
 2019-04-19 20:17:20,108 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: 
ExitCodeException exitCode=27:
 2019-04-19 20:17:20,108 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
 2019-04-19 20:17:20,108 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
org.apache.hadoop.util.Shell.run(Shell.java:482)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:299)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
java.lang.Thread.run(Thread.java:745)
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell output: main 
: command provided 1
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is 
test_hadoop
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : requested 
yarn user is datadev
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to cgroup 
task files...
 2019-04-19 20:17:20,109 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Can't open file 
/sys/fs/cgroup/cpu as node manager - Is a directory
 2019-04-19 20:17:20,131 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Container exited with a non-zero exit code 27
 2019-04-19 20:17:20,133 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_1554210318404_0042_01_000001 transitioned from RUNNING to 
EXITED_WITH_FAILURE
 2019-04-19 20:17:20,133 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_1554210318404_0042_01_000001
  
{panel}


> can not use CGroups with YARN in centos7 
> -----------------------------------------
>
>                 Key: YARN-9518
>                 URL: https://issues.apache.org/jira/browse/YARN-9518
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.2.0, 2.9.2, 2.8.5, 2.7.7, 3.1.2
>            Reporter: Shurong Mai
>            Priority: Major
>              Labels: cgroup, patch
>
> When I had set configuration variables  for cgroup with yarn, nodemanager 
> could be start without any matter. But when I ran a job, the job failed with 
> these exceptional nodemanager logs in the end.
> In these logs, the important logs is " Can't open file /sys/fs/cgroup/cpu as 
> node manager - Is a directory "
> After I analysed, I found the reason. In centos6, the cgroup "cpu" and 
> "cpuacct" subsystem are as follows: 
> {code:java}
> /sys/fs/cgroup/cpu
> /sys/fs/cgroup/cpuacct
> {code}
> But in centos7, as follows:
> {code:java}
> /sys/fs/cgroup/cpu -> cpu,cpuacct
> /sys/fs/cgroup/cpuacct -> cpu,cpuacct
> /sys/fs/cgroup/cpu,cpuacct{code}
> "cpu" and "cpuacct" have merge as "cpu,cpuacct".  "cpu"  and  "cpuacct"  are 
> symbol links. 
> As I look at source code, nodemamager get the cgroup subsystem info by 
> reading /proc/mounts. So It get the cpu and cpuacct subsystem path are also 
> "/sys/fs/cgroup/cpu,cpuacct". 
> The resource description arguments of container-executor is such as follows:
>  
> {code:java}
> cgroups=/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_000001/tasks
> {code}
> There is a comma in the cgroup path, but the comma is separator of multi 
> resource. Therefore, the cgroup path is truncated as "/sys.fs/cgroup/cpu" 
> rather than correct cgroup path " 
> /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_000001/tasks
>  "
>  
>  
>  
> {panel:title=exceptional nodemanager logs}
> 2019-04-19 20:17:20,095 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1554210318404_0042_01_000001 transitioned from LOCALIZED 
> to RUNNING
>  2019-04-19 20:17:20,101 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1554210318404_0042_01_000001 is : 27
>  2019-04-19 20:17:20,103 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception 
> from container-launch with container ID: container_155421031840
>  4_0042_01_000001 and exit code: 27
>  ExitCodeException exitCode=27:
>          at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
>          at org.apache.hadoop.util.Shell.run(Shell.java:482)
>          at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
>          at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:299)
>          at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>          at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>          at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>          at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>          at java.lang.Thread.run(Thread.java:745)
>  2019-04-19 20:17:20,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
> container-launch.
>  2019-04-19 20:17:20,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: 
> container_1554210318404_0042_01_000001
>  2019-04-19 20:17:20,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 27
>  2019-04-19 20:17:20,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: 
> ExitCodeException exitCode=27:
>  2019-04-19 20:17:20,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
>  2019-04-19 20:17:20,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> org.apache.hadoop.util.Shell.run(Shell.java:482)
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:299)
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> java.lang.Thread.run(Thread.java:745)
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell output: 
> main : command provided 1
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is 
> test_hadoop
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : requested 
> yarn user is datadev
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to 
> cgroup task files...
>  2019-04-19 20:17:20,109 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Can't open file 
> /sys/fs/cgroup/cpu as node manager - Is a directory
>  2019-04-19 20:17:20,131 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container exited with a non-zero exit code 27
>  2019-04-19 20:17:20,133 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1554210318404_0042_01_000001 transitioned from RUNNING 
> to EXITED_WITH_FAILURE
>  2019-04-19 20:17:20,133 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1554210318404_0042_01_000001
>   
> {panel}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to