zhihai xu commented on YARN-3591:

Hi [~vvasudev], thanks for the suggestion.
It looks like your suggestion is similar as [~lavkesh]'s original patch 
0001-YARN-3591.patch. Compared to [~lavkesh]'s original patch, your suggestion 
sometimes may not detect the disk failure because LocalDirHandlerService only 
calls {{checkDirs}} every 2 minutes by default and if the disk failure happens 
right after {{checkDirs}} is called and before {{isResourcePresent}} is called, 
your suggestion won't detect the disk failure but [~lavkesh]'s original patch 
can detect the disk failure. So it looks like [~lavkesh]'s original patch is 
better than your suggestion. It is my understanding, and please correct me if I 
am wrong.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -------------------------------------------------------------------------
>                 Key: YARN-3591
>                 URL: https://issues.apache.org/jira/browse/YARN-3591
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Lavkesh Lahngir
>            Assignee: Lavkesh Lahngir
>         Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.

This message was sent by Atlassian JIRA

Reply via email to