Jun Gong created YARN-3831: ------------------------------ Summary: Localization failed when a local disk turns from bad to good without NM initializes it Key: YARN-3831 URL: https://issues.apache.org/jira/browse/YARN-3831 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong
A local disk turns from bad to good without NM initializes it(create /path-to-local-dir/usercache and /path-to-local-dir/filecache). When localizing a container, container-executor will try to create directories under /path-to-local-dir/usercache, and it will fail. Then container's localization will fail. Related log is as following: {noformat} 2015-06-19 18:00:01,205 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1431957472783_38706012_01_000465 2015-06-19 18:00:01,212 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /data8/yarnenv/local/nmPrivate/container_1431957472783_38706012_01_000465.tokens. Credentials list: 2015-06-19 18:00:01,216 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1431957472783_38706012_01_000465 startLocalizer is : 20 org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981) 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command provided 0 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is tdwadmin 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.IOException: Application application_1431957472783_38706012 initialization failed (exitCode=20) with output: main : command provided 0 main : user is tdwadmin Failed to create directory /data2/yarnenv/local/usercache/tdwadmin - No such file or directory at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:214) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:981) Caused by: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:205) ... 1 more 2015-06-19 18:00:01,216 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1431957472783_38706012_01_000465 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)