Jason Lowe commented on YARN-3591:

One potential issue with that approach is long-running containers and 
"partially bad" disks.  A bad disk can still have readable files on it.  If we 
blindly remove all the files on a repaired disk then we risk removing the files 
from underneath a running container.  On UNIX/Linux this may be less of an 
issue if the container is referencing the files with file descriptors that 
don't close, but it would cause problems if the container re-opens the files at 
some point or is running on an OS that doesn't reference-count files before 
removing the data.

This is off the top of my head and is probably not the most efficient solution, 
but I think it could work:
# We support mapping LocalResourceRequests to a collection of 
LocalizedResource.  This allows us to track duplicate localizations.
# When a resource request maps to only LocalizedResource entries that 
correspond to bad disks, we make the worst-case assumption that the file is 
inaccessible on those bad disks and re-localize it as another LocalizeResource 
entry (i.e.: a duplicate).
# When a container completes, we decrement the refcount on the appropriate 
LocalizedResources.  We're already tracking the references by container ID, so 
we can scan the collection to determine which one of the duplicates the 
container was referencing.
# When a refcount of a resource for a bad disk goes to zero we don't delete it 
(since the disk is probably read-only at that point) and instead just remove 
the LocalizedResource entry from the map (or potentially leave it around with a 
zero refcount to make the next step a bit cheaper).
# When a disk is repaired, we scan it for any local directory that doesn't 
correspond to a LocalizedResource resource we know about.  Those local 
directories can be removed, while directories that map to "active" resources 
are preserved.

One issue with this approach is NM restart.  We currently don't track container 
references in the state store since we can reconstruct them on startup due to 
the assumed one-to-one mapping of ResourceRequests to LocalizedResources.  This 
proposal violates that assumption, so we'd have to start tracking container 
references explicitly in the state store to do this approach.

A much simpler but harsher approach is to kill containers that are referencing 
resources on bad disks with the assumption they will fail or be too slow when 
accessing the files there in the interest of "failing fast."  However in 
practice I could see many containers having at least one resource that's on the 
bad disk, and that could end up killing most/all the containers on a node just 
because one disk failed.  Again a disk going bad doesn't necessarily mean all 
of the data is inaccessible, so we could be killing containers that otherwise 
wouldn't know or care about the bad disk (e.g.: they could have cached the 
resource in memory before the disk went bad).

> Resource Localisation on a bad disk causes subsequent containers failure 
> -------------------------------------------------------------------------
>                 Key: YARN-3591
>                 URL: https://issues.apache.org/jira/browse/YARN-3591
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Lavkesh Lahngir
>            Assignee: Lavkesh Lahngir
>         Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.

This message was sent by Atlassian JIRA

Reply via email to