Shane Kumpf created YARN-8751:
---------------------------------

             Summary: Container-executor permission check errors cause the NM 
to be marked unhealthy
                 Key: YARN-8751
                 URL: https://issues.apache.org/jira/browse/YARN-8751
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Shane Kumpf


{{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
{{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
occurs based on the exit code returned by container-executor, and 7 different 
exit codes cause the NM to be marked UNHEALTHY.
{code:java}
if (exitCode ==
    ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
    exitCode ==
        ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
    exitCode ==
        ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
    exitCode ==
        ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
    exitCode ==
        ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
    exitCode ==
        ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
    exitCode ==
        ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
  throw new ConfigurationException(
      "Linux Container Executor reached unrecoverable exception", e);{code}
I can understand why these are treated as fatal with the existing process 
container model. However, with privileged Docker containers this may be too 
harsh, as Privileged Docker containers don't guarantee the user's identity will 
be propagated into the container.

In our case, a privileged container changed the 
"appcache/<appid>/<containerid>" directory permissions to 774. Some time later, 
the process in the container died and the Retry Policy kicked in to RELAUNCH 
the container. When the RELAUNCH occurred, container-executor checked the 
permissions of the "appcache/<appid>/<containerid>" directory (the existing 
workdir is retained for RELAUNCH) and returned exit code 35. Exit code 35 is 
COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
containers running on that node, when really only this container would have 
been impacted.
{code:java}
2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(541)) - Container id: 
container_e15_1535130383425_0085_01_000005
2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(541)) - Exit code: 35
2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch container 
failed
2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(541)) - Shell error output: Could not create 
container dirsCould not create local files and directories 5 6
2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(541)) -
2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(541)) - Shell output: main : command provided 
4
2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(541)) - main : run as user is user
2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(541)) - Creating script paths...
2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(541)) - Creating local dirs...
2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(541)) - Path 
/grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_000005
 has permission 774 but needs per
mission 750.
2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
(ContainerExecutor.java:logOutput(541)) - Wrote the exit code 35 to (null)
2018-08-31 21:07:22,386 ERROR launcher.ContainerRelaunch 
(ContainerRelaunch.java:call(129)) - Failed to launch container due to 
configuration error.
org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container 
Executor reached unrecoverable exception
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleExitCode(LinuxContainerExecutor.java:633)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:573)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
 Relaunch container failed
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:987)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562)
        ... 8 more
{code}
The root of the issue could be considered the fact that we can't guarantee 
which user is running in the container, and should eliminate writable mounts in 
this scenario. However, marking the NM unhealthy in all these cases does seem 
overkill.

Opening this to discuss how we want to address this issue. [~jlowe] [~ebadger] 
[~Jim_Brennan] [~eyang] [~billie.rinaldi] [~ccondit-target] let me know your 
thoughts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to