[
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635292#comment-14635292
]
Jason Lowe commented on YARN-3591:
----------------------------------
Sorry for the delay, as I was on vacation and am still working through the
backlog. An incremental improvement where we try to avoid using
bad/non-existent resources for future containers but still fail to cleanup old
resources on bad disks sounds fine to me. IIUC it fixes some problems we have
today without creating new ones.
> Resource Localisation on a bad disk causes subsequent containers failure
> -------------------------------------------------------------------------
>
> Key: YARN-3591
> URL: https://issues.apache.org/jira/browse/YARN-3591
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Lavkesh Lahngir
> Assignee: Lavkesh Lahngir
> Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch,
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch
>
>
> It happens when a resource is localised on the disk, after localising that
> disk has gone bad. NM keeps paths for localised resources in memory. At the
> time of resource request isResourcePresent(rsrc) will be called which calls
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which
> will call open() natively. If the disk is good it should return an array of
> paths with length at-least 1.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)