Lavkesh Lahngir commented on YARN-3591:

LocalResourcesTrackerImpl keeps a ref count for resources. 
remove(LocalizedResource req, DeletionService delService)
will fail when the reference count is non-zero. In the case of non-zero ref 
count,It will not remove that resource. And in the future there is no way to 
remove the localized resource unless again localdirs are changed.
Should we mark these resources as not-usable if we are not able to remove it? 
In this case we need to check if a resource is localized and it is not marked 
as not-usable before passing it to a new container. 

> Resource Localisation on a bad disk causes subsequent containers failure 
> -------------------------------------------------------------------------
>                 Key: YARN-3591
>                 URL: https://issues.apache.org/jira/browse/YARN-3591
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Lavkesh Lahngir
>         Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.

This message was sent by Atlassian JIRA

Reply via email to