[
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606270#comment-16606270
]
Craig Condit commented on YARN-8751:
------------------------------------
[[email protected]], looks like have consensus on the approach. I can take
this one.
> Container-executor permission check errors cause the NM to be marked unhealthy
> ------------------------------------------------------------------------------
>
> Key: YARN-8751
> URL: https://issues.apache.org/jira/browse/YARN-8751
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Shane Kumpf
> Priority: Critical
> Labels: Docker
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception
> occurs based on the exit code returned by container-executor, and 7 different
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
> ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
> exitCode ==
> ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
> throw new ConfigurationException(
> "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process
> container model. However, with privileged Docker containers this may be too
> harsh, as Privileged Docker containers don't guarantee the user's identity
> will be propagated into the container, so these mismatches can occur. Outside
> of privileged containers, an application may inadvertently change the
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache/<appid>/<containerid>"
> directory permissions to 774. Some time later, the process in the container
> died and the Retry Policy kicked in to RELAUNCH the container. When the
> RELAUNCH occurred, container-executor checked the permissions of the
> "appcache/<appid>/<containerid>" directory (the existing workdir is retained
> for RELAUNCH) and returned exit code 35. Exit code 35 is
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all
> containers running on that node, when really only this container would have
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(541)) - Container id:
> container_e15_1535130383425_0085_01_000005
> 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(541)) - Exit code: 35
> 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch
> container failed
> 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not
> create container dirsCould not create local files and directories 5 6
> 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(541)) -
> 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(541)) - Shell output: main : command
> provided 4
> 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(541)) - main : run as user is user
> 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
> 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(541)) - Creating script paths...
> 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(541)) - Creating local dirs...
> 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(541)) - Path
> /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_000005
> has permission 774 but needs per
> mission 750.
> 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor
> (ContainerExecutor.java:logOutput(541)) - Wrote the exit code 35 to (null)
> 2018-08-31 21:07:22,386 ERROR launcher.ContainerRelaunch
> (ContainerRelaunch.java:call(129)) - Failed to launch container due to
> configuration error.
> org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container
> Executor reached unrecoverable exception
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleExitCode(LinuxContainerExecutor.java:633)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:573)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
> Relaunch container failed
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:987)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562)
> ... 8 more
> {code}
> The root of the issue could be considered the fact that we can't guarantee
> which user is running in the container, and should eliminate writable mounts
> in this scenario. However, marking the NM unhealthy in all these cases does
> seem overkill.
> Opening this to discuss how we want to address this issue. [~jlowe]
> [~ebadger] [~Jim_Brennan] [~eyang] [~billie.rinaldi] [~ccondit-target] let me
> know your thoughts.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]