[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546568#comment-14546568 ]
Lavkesh Lahngir commented on YARN-3591: --------------------------------------- [~vinodkv]: The concern here is, If a resource is present in the LocalResourcesTrackerImpl cache(in memory), It will go and just check file.exists() and it is retuning true even if the disk is not readable. We wanted to remove this cache and the state-store so that it will be missing when it is requested so it could be downloaded again. This is not a case of localization failure. [~zxu] In other case when a disk goes bad while it has resources and other container related files, will they ever be deleted when that disk becomes good? I understand that the resources will be deleted (from disk) which least recently used when the max cache size is reached or limit on the number of directories is reached. IMO If above cache clean up(from disk) is acceptable then we can just call removeResource() instead of remove() in the case of a resource is found on a bad disk, which will remove it from the memory and state store. > Resource Localisation on a bad disk causes subsequent containers failure > ------------------------------------------------------------------------- > > Key: YARN-3591 > URL: https://issues.apache.org/jira/browse/YARN-3591 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: Lavkesh Lahngir > Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, > YARN-3591.2.patch > > > It happens when a resource is localised on the disk, after localising that > disk has gone bad. NM keeps paths for localised resources in memory. At the > time of resource request isResourcePresent(rsrc) will be called which calls > file.exists() on the localised path. > In some cases when disk has gone bad, inodes are stilled cached and > file.exists() returns true. But at the time of reading, file will not open. > Note: file.exists() actually calls stat64 natively which returns true because > it was able to find inode information from the OS. > A proposal is to call file.list() on the parent path of the resource, which > will call open() natively. If the disk is good it should return an array of > paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)