[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

zhihai xu (JIRA) Fri, 15 May 2015 18:01:48 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546461#comment-14546461
 ]


zhihai xu commented on YARN-3591:
---------------------------------

[~vinodkv], yes, keeping the ownership of turning disks good or bad in one 
single place is a very good suggestion. So it is reasonable to keep all the 
disk checking at DirectoryCollection.
Normally CacheCleanup thread will periodically send CACHE_CLEANUP event to 
cleanup these localized files in LocalResourcesTrackerImpl.
If we only remove these localized resources on the "bad" disk which can't be 
recovered, it will be ok. Here "bad" disk is different from "full" disk. I 
suppose all the files on the "bad" disk will be lost/deleted when it becomes 
good. Keeping app level resources sounds reasonable to me.

> Resource Localisation on a bad disk causes subsequent containers failure 
> -------------------------------------------------------------------------
>
>                 Key: YARN-3591
>                 URL: https://issues.apache.org/jira/browse/YARN-3591
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Lavkesh Lahngir
>         Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

Reply via email to