[
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593472#comment-14593472
]
Jason Lowe commented on YARN-3591:
----------------------------------
One potential issue with that approach is long-running containers and
"partially bad" disks. A bad disk can still have readable files on it. If we
blindly remove all the files on a repaired disk then we risk removing the files
from underneath a running container. On UNIX/Linux this may be less of an
issue if the container is referencing the files with file descriptors that
don't close, but it would cause problems if the container re-opens the files at
some point or is running on an OS that doesn't reference-count files before
removing the data.
This is off the top of my head and is probably not the most efficient solution,
but I think it could work:
# We support mapping LocalResourceRequests to a collection of
LocalizedResource. This allows us to track duplicate localizations.
# When a resource request maps to only LocalizedResource entries that
correspond to bad disks, we make the worst-case assumption that the file is
inaccessible on those bad disks and re-localize it as another LocalizeResource
entry (i.e.: a duplicate).
# When a container completes, we decrement the refcount on the appropriate
LocalizedResources. We're already tracking the references by container ID, so
we can scan the collection to determine which one of the duplicates the
container was referencing.
# When a refcount of a resource for a bad disk goes to zero we don't delete it
(since the disk is probably read-only at that point) and instead just remove
the LocalizedResource entry from the map (or potentially leave it around with a
zero refcount to make the next step a bit cheaper).
# When a disk is repaired, we scan it for any local directory that doesn't
correspond to a LocalizedResource resource we know about. Those local
directories can be removed, while directories that map to "active" resources
are preserved.
One issue with this approach is NM restart. We currently don't track container
references in the state store since we can reconstruct them on startup due to
the assumed one-to-one mapping of ResourceRequests to LocalizedResources. This
proposal violates that assumption, so we'd have to start tracking container
references explicitly in the state store to do this approach.
A much simpler but harsher approach is to kill containers that are referencing
resources on bad disks with the assumption they will fail or be too slow when
accessing the files there in the interest of "failing fast." However in
practice I could see many containers having at least one resource that's on the
bad disk, and that could end up killing most/all the containers on a node just
because one disk failed. Again a disk going bad doesn't necessarily mean all
of the data is inaccessible, so we could be killing containers that otherwise
wouldn't know or care about the bad disk (e.g.: they could have cached the
resource in memory before the disk went bad).
> Resource Localisation on a bad disk causes subsequent containers failure
> -------------------------------------------------------------------------
>
> Key: YARN-3591
> URL: https://issues.apache.org/jira/browse/YARN-3591
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Lavkesh Lahngir
> Assignee: Lavkesh Lahngir
> Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch,
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch
>
>
> It happens when a resource is localised on the disk, after localising that
> disk has gone bad. NM keeps paths for localised resources in memory. At the
> time of resource request isResourcePresent(rsrc) will be called which calls
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which
> will call open() natively. If the disk is good it should return an array of
> paths with length at-least 1.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)