[
https://issues.apache.org/jira/browse/YARN-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15656671#comment-15656671
]
Bibin A Chundatt commented on YARN-5867:
----------------------------------------
Thank you [~jlowe] for looking into issue
Sorry missed to add about bad disk scenario.The following sequence of steps
could happen in actual cluster also.
# Bad disk was shown in RM UI due to hardware fault.(1 of the disk)
# Formatted and mounted again or new disk added
# After 2 min interval in RM UI node was healthy.(Admin also will think server
is healthy)
# But containers will start failing randomly.
Will implement patch based on solution 1 and upload soon. Additional logging
mentioning {{nmlocal}} folder is created in {{DirectoryCollection#testDirs}}
will be included .
> DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir
> ---------------------------------------------------------------------------
>
> Key: YARN-5867
> URL: https://issues.apache.org/jira/browse/YARN-5867
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Bibin A Chundatt
> Assignee: Bibin A Chundatt
>
> Steps to reproduce
> ===============
> # Set umask to 077 for user
> # Start nodemanager with nmlocal dir configured
> nmlocal dir permission is *755*
> {{LocalDirsHandlerService#serviceInit}}
> {code}
> FsPermission perm = new FsPermission((short)0755);
> boolean createSucceeded = localDirs.createNonExistentDirs(localFs, perm);
> createSucceeded &= logDirs.createNonExistentDirs(localFs, perm);
> {code}
> # After startup delete the nmlocal dir and wait for {{MonitoringTimerTask}}
> to run (simulation using delete)
> # Now check the permission of {{nmlocal dir}} will be *700*
> *Root Cause*
> {{DirectoryCollection#testDirs}} checks as following
> {code}
> // create a random dir to make sure fs isn't in read-only mode
> verifyDirUsingMkdir(testDir);
> {code}
> which cause a new Random directory to be create in {{localdir}} using
> {{DiskChecker.checkDir(dir)}} -> {{!mkdirsWithExistsCheck(dir)}} causing the
> nmlocal dir to be created with wrong permission. *700*
> Few application fail to container launch due to permission denied.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]