zhihai xu commented on YARN-3591:

I think the current code call {{removeResource}} instead of {{remove}} to 
remove a localized resource which can't be accessed due to disk error.
We may do the same because all the containers which use the localized resources 
on a bad disk may fail and removing these resources early looks like reasonable.
But I think we should be careful for the disks which are full, It may not be 
good to remove localized resources on the full disks because full disks may 
become good disks after files are removed by CacheCleanup. Need more thoughts 
for the full disks, maybe We can add a new signaling for disks becoming bad in 

> Resource Localisation on a bad disk causes subsequent containers failure 
> -------------------------------------------------------------------------
>                 Key: YARN-3591
>                 URL: https://issues.apache.org/jira/browse/YARN-3591
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Lavkesh Lahngir
>         Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.

This message was sent by Atlassian JIRA

Reply via email to