[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592560#comment-14592560
 ] 

zhihai xu commented on YARN-3591:
---------------------------------

Hi [~vvasudev],
bq. can you explain how using onChange will help with the zombie issue?
If a disk becomes bad, the files in it may not be deleted correctly until the 
disk becomes good later. Also in LocalResourcesTrackerImpl.java, after the 
LocalizedResource is detected in bad disk by {{isResourcePresent}}, 
{{removeResource}} is called to remove it from 
{{LocalResourcesTrackerImpl#localrsrc}} and NM state store but it is not 
deleted from the bad disk, these localized files will become zombie files after 
the bad disks are repaired.
The following code in my proposal #4, which is called inside {{onDirsChanged}}, 
may solve this issue:
{code}
for (String localDir : newRepairedDirs) {
cleanUpLocalDir(lfs, delService, localDir);
}
{code}
Please let me know if I am missing something.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -------------------------------------------------------------------------
>
>                 Key: YARN-3591
>                 URL: https://issues.apache.org/jira/browse/YARN-3591
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Lavkesh Lahngir
>            Assignee: Lavkesh Lahngir
>         Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to