[
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727698#comment-14727698
]
Varun Vasudev commented on YARN-3591:
-------------------------------------
Thanks for the latest patch Lavkesh! Couple of comments -
1.
Instead of
{code}
+ this.dirsHandler = dirHandler;
{code}
in the new constructors you added, can you add that line to
{code}
LocalResourcesTrackerImpl(String user, ApplicationId appId,
Dispatcher dispatcher,
ConcurrentMap<LocalResourceRequest,LocalizedResource> localrsrc,
boolean useLocalCacheDirectoryManager, Configuration conf,
NMStateStoreService stateStore)
{code}
and have the other constructors call this one? Pass null for the directory
handler if the existing constructors are called.
2.
{code}
+ ret |= isParent(rsrc.getLocalPath().toUri().getPath(), dir);
{code}
We don't need to iterate through all the local dirs. Once ret is true we can
break the loop and return.
Rest of the patch looks good.
> Resource Localisation on a bad disk causes subsequent containers failure
> -------------------------------------------------------------------------
>
> Key: YARN-3591
> URL: https://issues.apache.org/jira/browse/YARN-3591
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Lavkesh Lahngir
> Assignee: Lavkesh Lahngir
> Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch,
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch,
> YARN-3591.6.patch, YARN-3591.7.patch, YARN-3591.8.patch
>
>
> It happens when a resource is localised on the disk, after localising that
> disk has gone bad. NM keeps paths for localised resources in memory. At the
> time of resource request isResourcePresent(rsrc) will be called which calls
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which
> will call open() natively. If the disk is good it should return an array of
> paths with length at-least 1.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)