[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595250#comment-14595250 ]
zhihai xu commented on YARN-3591: --------------------------------- Hi [~jlowe], thanks for the thorough analysis. My assumption is that the files on a bad disk are most likely inaccessible, it looks like my assumption is wrong. It looks like your first approach is better with fewer side effects. Item 5 may be very time-consuming. I can think of the following possible improvements for your first approach: # Cache all the local directories which are used by running containers for LocalizedResource with non-zero refcount. This may speed up item 5. We only need keep all the cached directories on a disk which is just repaired. # Maybe we can remove the LocalizedResource entry with zero refcount for a bad disk from the map in {{onDirsChanged}}. We should also remove it when handling {{RELEASE}} ResourceEvent. # It looks like we still need store the bad local dirs in the state store, so we can track disks, which are repaired, during NM recovery. > Resource Localisation on a bad disk causes subsequent containers failure > ------------------------------------------------------------------------- > > Key: YARN-3591 > URL: https://issues.apache.org/jira/browse/YARN-3591 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.0 > Reporter: Lavkesh Lahngir > Assignee: Lavkesh Lahngir > Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, > YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch > > > It happens when a resource is localised on the disk, after localising that > disk has gone bad. NM keeps paths for localised resources in memory. At the > time of resource request isResourcePresent(rsrc) will be called which calls > file.exists() on the localised path. > In some cases when disk has gone bad, inodes are stilled cached and > file.exists() returns true. But at the time of reading, file will not open. > Note: file.exists() actually calls stat64 natively which returns true because > it was able to find inode information from the OS. > A proposal is to call file.list() on the parent path of the resource, which > will call open() natively. If the disk is good it should return an array of > paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)