[
https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919295#comment-13919295
]
Sunil G commented on YARN-257:
------------------------------
May be NM can do some level of handling by itself in Disk Full scenario as in
first place.
NM's LocalDirAllocator gives a local path to write from the "good" list of
directories.
But for this, it uses a round robin algorithm based on space available.
In a scenario like below, if more tasks asks for path from the set of local
directories,
then it is possible that the allocation is done based on the current
availability at that given time.
But this path would have earlier given to some other tasks to write and they
may be sequentially doing writing.
Basically the allotted space is not considered when next allocation is given
for another task from same path.
[Assuming few earlier allocated tasks is doing write at this time]
But it is not possible to consider this earlier allotted space and it is not
possible to predict the disk write speed.
Could it be possible to predict disk full scenario rather than acting on when
it happens.
For Eg, current health check mechanism will check access permission etc to
identify and good and bad directories for 2 minute interval.
Here if the space is almost full (say 95% or only 5*100Mb is remaining), then
it is better to move that directory to bad list directories.
Or in the LocalDirAllocator, it is better to check for high percentage of disk
used. And do not assign such a directory to that task.
These measures might possible help to resolve the new tasks not to fail because
of an immediate disk full scenario.
> NM should gracefully handle a full local disk
> ---------------------------------------------
>
> Key: YARN-257
> URL: https://issues.apache.org/jira/browse/YARN-257
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Jason Lowe
>
> When a local disk becomes full, the node will fail every container launched
> on it because the container is unable to localize. It tries to create an
> app-specific directory for each local and log directories. If any of those
> directory creates fail (due to lack of free space) the container fails.
> It would be nice if the node could continue to launch containers using the
> space available on other disks rather than failing all containers trying to
> launch on the node.
> This is somewhat related to YARN-91 but is centered around the disk becoming
> full rather than the disk failing.
--
This message was sent by Atlassian JIRA
(v6.2#6252)