[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606003#comment-16606003
 ] 

Eric Yang commented on YARN-8751:
---------------------------------

[[email protected]] I believe the COULD_NOT_CREATE_WORK_DIRECTORIES exit 
code needs to happen on all disks before the option is exhausted.  Introduction 
of relaunch may single out a single working directory, and report a false 
positive response while the system may have option to fall back to create new 
working directory on other disks to move forward.  I am not sure if the test 
system has more than one local disks.  If it only had one disk, it may appear 
this single container crashes the node manager.  If relaunch doesn't retry 
other disks, then it is a bug to change container-executor logic to detect such 
case and create working directory on other disks.  This is similar to fault 
tolerance design in HDFS, relaunch is best effort to reuse the same working 
directory, but use other data directory, if the current one has turned bad.

Let's look at the problem from a different angles, the container is doing 
destructive operation to working directory and knock out all disks by abusing 
relaunch.  This looks more like a deliberate attempt to sabotage the system.  
In this case, it is really system administrator's responsibility to disallow 
such badly behaved user/image to grant them privileged container.  This is same 
as saying, don't hand them a chainsaw, if you know they are irresponsible 
individuals.  There is little that can be done to protect irresponsible 
individuals from themselves.  You can only protect them by not giving them too 
much power.  Disable write mount for privileged container is the wrong option 
because there are real program that can run multi-users container that depends 
on privileged container feature.  If the badly behaved program is a QA test, 
then we may need to hand wave that we hand you a chainsaw, read the 
instructions and be careful with it.

> Container-executor permission check errors cause the NM to be marked unhealthy
> ------------------------------------------------------------------------------
>
>                 Key: YARN-8751
>                 URL: https://issues.apache.org/jira/browse/YARN-8751
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Shane Kumpf
>            Priority: Critical
>              Labels: Docker
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
> occurs based on the exit code returned by container-executor, and 7 different 
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
>     ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
>     exitCode ==
>         ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
>     exitCode ==
>         ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
>     exitCode ==
>         ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
>     exitCode ==
>         ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
>     exitCode ==
>         ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
>     exitCode ==
>         ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
>   throw new ConfigurationException(
>       "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process 
> container model. However, with privileged Docker containers this may be too 
> harsh, as Privileged Docker containers don't guarantee the user's identity 
> will be propagated into the container, so these mismatches can occur. Outside 
> of privileged containers, an application may inadvertently change the 
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache/<appid>/<containerid>" 
> directory permissions to 774. Some time later, the process in the container 
> died and the Retry Policy kicked in to RELAUNCH the container. When the 
> RELAUNCH occurred, container-executor checked the permissions of the 
> "appcache/<appid>/<containerid>" directory (the existing workdir is retained 
> for RELAUNCH) and returned exit code 35. Exit code 35 is 
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
> containers running on that node, when really only this container would have 
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Container id: 
> container_e15_1535130383425_0085_01_000005
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exit code: 35
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch 
> container failed
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not 
> create container dirsCould not create local files and directories 5 6
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) -
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell output: main : command 
> provided 4
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : run as user is user
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating script paths...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating local dirs...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Path 
> /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_000005
>  has permission 774 but needs per
> mission 750.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Wrote the exit code 35 to (null)
> 2018-08-31 21:07:22,386 ERROR launcher.ContainerRelaunch 
> (ContainerRelaunch.java:call(129)) - Failed to launch container due to 
> configuration error.
> org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container 
> Executor reached unrecoverable exception
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleExitCode(LinuxContainerExecutor.java:633)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:573)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
>  Relaunch container failed
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:987)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562)
>         ... 8 more
> {code}
> The root of the issue could be considered the fact that we can't guarantee 
> which user is running in the container, and should eliminate writable mounts 
> in this scenario. However, marking the NM unhealthy in all these cases does 
> seem overkill.
> Opening this to discuss how we want to address this issue. [~jlowe] 
> [~ebadger] [~Jim_Brennan] [~eyang] [~billie.rinaldi] [~ccondit-target] let me 
> know your thoughts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to