[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590484#comment-14590484
 ] 

zhihai xu commented on YARN-3591:
---------------------------------

Hi [~vvasudev], thanks for the explanation.
IMHO, If we want the LocalDirHandlerService to be a central place for the state 
of the local dirs, doing it in {{DirsChangeListener#onDirsChanged}} will be 
better. IIUC, it is also your suggestion.
The benefits for doing this are:
1. It will give better performance. because you will do it only when some Dirs 
become bad, which should happen rarely,
you won't waste your time to do it for every localization request.
2. It will also help the issue "What about zombie files lying in the various 
paths" which [~lavkesh] found, a similar issue as YARN-2624.
3. {{checkLocalizedResources}}/{{removeResource}} called by {{onDirsChanged}} 
will be done inside {{LocalDirsHandlerService#checkDirs}} without any delay.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -------------------------------------------------------------------------
>
>                 Key: YARN-3591
>                 URL: https://issues.apache.org/jira/browse/YARN-3591
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Lavkesh Lahngir
>            Assignee: Lavkesh Lahngir
>         Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to