[
https://issues.apache.org/jira/browse/YARN-5749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tao Yang updated YARN-5749:
---------------------------
Summary: Fail to localize resources after health status for local dirs
changed (was: Fail to localize resources after health status for local dirs
changed occurred by the change of FileContext#setUMask)
> Fail to localize resources after health status for local dirs changed
> ---------------------------------------------------------------------
>
> Key: YARN-5749
> URL: https://issues.apache.org/jira/browse/YARN-5749
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 3.0.0-alpha2
> Reporter: Tao Yang
>
> HADOOP-13440 updated FileContext#setUMask method to change umask from local
> variable to global variable through updating conf value of
> "fs.permissions.umask-mode".
> This method might be called to update value for global umask by LogWriter and
> ResourceLocalizationService.
> After an application finished, LogWriter will update the umask value to be
> "137" while uploading logs for containers. Then the global umask value is
> updated right now and will affect other services. In my case , After one of
> local directories is marked as bad (because the disk used space is above the
> threshold defined by
> "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage"),
> ResourceLocalizationService will reinitailize the left local directories and
> change the permission from "drwxr-xr-x" to "drw-r-----"(umask value changed
> from "022" to "137"). From now on, The NM will always fail to localize
> resources as the local directories is not executable.
> Detail logs are as follows:
> {code}
> 2016-10-19 15:36:32,650 WARN
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext: Disk Error
> Exception:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not
> executable: /home/yangtao.yt/hadoop-data/nm-local-dir-2/nmPrivate
> at
> org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:215)
> at
> org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:190)
> at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:124)
> at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createPath(LocalDirAllocator.java:350)
> at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:412)
> at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:151)
> at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132)
> at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:116)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForWrite(LocalDirsHandlerService.java:563)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1162)
> 2016-10-19 15:36:32,650 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> Localizer failed
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
> valid local directory for
> nmPrivate/container_e26_1476858409240_0004_01_000005.tokens
> at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:441)
> at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:151)
> at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132)
> at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:116)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForWrite(LocalDirsHandlerService.java:563)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1162)
> 2016-10-19 15:36:32,652 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
> Container container_e26_1476858409240_0004_01_000005 transitioned from
> LOCALIZING to LOCALIZATION_FAILED
> {code}
> In my opinion, it's better if FileContext can compatible with past usage.
> Please feel free to give your suggestions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]