[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606007#comment-16606007 ]
Craig Condit commented on YARN-8751: ------------------------------------ Each of these error codes could have any number of root causes ranging from transient to task-specific, disk-specific, node-specific, or cluster-level. Trying to do root cause analysis of OS-level failures in code isn't really practical. No two environments are alike and it's going to be very difficult to set a policy which makes sense for all clusters. This is where things like admin-provided health check scripts come into play. These can check things like disks available, disks non-full, permissions (at top level dirs) set correctly, etc. That said, I think we should have defaults which cause the least amount of pain in the majority of cases. It seems to me that in most cases, it's far more likely to be a transient or per-disk issue causing these failures than a global misconfiguration, so not failing the NM makes sense. As a way to address detection of the specific issue mentioned in this JIRA, top-level permissions on NM-controlled dirs could be validated on startup (if they aren't already) and cause a NM failure at that point (or at least consider the specific disk bad). This would cause fail-fast behavior for something that is clearly configured wrong globally. it would also make these issues occuring at a container level far more likely to be transient or task/app-specific. > Container-executor permission check errors cause the NM to be marked unhealthy > ------------------------------------------------------------------------------ > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Shane Kumpf > Priority: Critical > Labels: Docker > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache/<appid>/<containerid>" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache/<appid>/<containerid>" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_000005 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exit code: 35 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch > container failed > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not > create container dirsCould not create local files and directories 5 6 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell output: main : command > provided 4 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : run as user is user > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating script paths... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating local dirs... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Path > /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_000005 > has permission 774 but needs per > mission 750. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Wrote the exit code 35 to (null) > 2018-08-31 21:07:22,386 ERROR launcher.ContainerRelaunch > (ContainerRelaunch.java:call(129)) - Failed to launch container due to > configuration error. > org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container > Executor reached unrecoverable exception > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleExitCode(LinuxContainerExecutor.java:633) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:573) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: > Relaunch container failed > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:987) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562) > ... 8 more > {code} > The root of the issue could be considered the fact that we can't guarantee > which user is running in the container, and should eliminate writable mounts > in this scenario. However, marking the NM unhealthy in all these cases does > seem overkill. > Opening this to discuss how we want to address this issue. [~jlowe] > [~ebadger] [~Jim_Brennan] [~eyang] [~billie.rinaldi] [~ccondit-target] let me > know your thoughts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org