[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727698#comment-14727698 ]
Varun Vasudev commented on YARN-3591: ------------------------------------- Thanks for the latest patch Lavkesh! Couple of comments - 1. Instead of {code} + this.dirsHandler = dirHandler; {code} in the new constructors you added, can you add that line to {code} LocalResourcesTrackerImpl(String user, ApplicationId appId, Dispatcher dispatcher, ConcurrentMap<LocalResourceRequest,LocalizedResource> localrsrc, boolean useLocalCacheDirectoryManager, Configuration conf, NMStateStoreService stateStore) {code} and have the other constructors call this one? Pass null for the directory handler if the existing constructors are called. 2. {code} + ret |= isParent(rsrc.getLocalPath().toUri().getPath(), dir); {code} We don't need to iterate through all the local dirs. Once ret is true we can break the loop and return. Rest of the patch looks good. > Resource Localisation on a bad disk causes subsequent containers failure > ------------------------------------------------------------------------- > > Key: YARN-3591 > URL: https://issues.apache.org/jira/browse/YARN-3591 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.0 > Reporter: Lavkesh Lahngir > Assignee: Lavkesh Lahngir > Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, > YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch, > YARN-3591.6.patch, YARN-3591.7.patch, YARN-3591.8.patch > > > It happens when a resource is localised on the disk, after localising that > disk has gone bad. NM keeps paths for localised resources in memory. At the > time of resource request isResourcePresent(rsrc) will be called which calls > file.exists() on the localised path. > In some cases when disk has gone bad, inodes are stilled cached and > file.exists() returns true. But at the time of reading, file will not open. > Note: file.exists() actually calls stat64 natively which returns true because > it was able to find inode information from the OS. > A proposal is to call file.list() on the parent path of the resource, which > will call open() natively. If the disk is good it should return an array of > paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)