[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546568#comment-14546568
 ] 

Lavkesh Lahngir commented on YARN-3591:
---------------------------------------

[~vinodkv]: The concern here is, If a resource is present in the 
LocalResourcesTrackerImpl cache(in memory), It will go and just check 
file.exists() and it is retuning true even if the disk is not readable. We 
wanted to remove this cache and the state-store so that it will be missing when 
it is requested so it could be downloaded again. This is not a case of 
localization failure.
[~zxu] In other case when a disk goes bad while it has resources and other 
container related files, will they ever be deleted when that disk becomes good? 
I understand that the resources will be deleted (from disk) which least 
recently used when the max cache size is reached or limit on the  number of 
directories is reached.

IMO If above cache clean up(from disk) is acceptable then we can just call 
removeResource() instead of remove() in the case of a resource is found on a 
bad disk, which will remove it from the memory and state store.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -------------------------------------------------------------------------
>
>                 Key: YARN-3591
>                 URL: https://issues.apache.org/jira/browse/YARN-3591
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Lavkesh Lahngir
>         Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to