[ 
https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919295#comment-13919295
 ] 

Sunil G commented on YARN-257:
------------------------------

May be NM can do some level of handling by itself in Disk Full scenario as in 
first place.
NM's LocalDirAllocator gives a local path to write from the "good" list of 
directories.
But for this, it uses a round robin algorithm based on space available.

In a scenario like below, if more tasks asks for path from the set of local 
directories, 
then it is possible that the allocation is done based on the current 
availability at that given time.
But this path would have earlier given to some other tasks to write and they 
may be sequentially doing writing.

Basically the allotted space is not considered when next allocation is given 
for another task from same path. 
[Assuming few earlier allocated tasks is doing write at this time]

But it is not possible to consider this earlier allotted space and it is not 
possible to predict the disk write speed.

Could it be possible to predict disk full scenario rather than acting on when 
it happens.
For Eg, current health check mechanism will check access permission etc to 
identify and good and bad directories for 2 minute interval.
Here if the space is almost full (say 95% or only 5*100Mb is remaining), then 
it is better to move that directory to bad list directories.

Or in the LocalDirAllocator, it is better to check for high percentage of disk 
used. And do not assign such a directory to that task.
These measures might possible help to resolve the new tasks not to fail because 
of an immediate disk full scenario.

> NM should gracefully handle a full local disk
> ---------------------------------------------
>
>                 Key: YARN-257
>                 URL: https://issues.apache.org/jira/browse/YARN-257
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>
> When a local disk becomes full, the node will fail every container launched 
> on it because the container is unable to localize.  It tries to create an 
> app-specific directory for each local and log directories.  If any of those 
> directory creates fail (due to lack of free space) the container fails.
> It would be nice if the node could continue to launch containers using the 
> space available on other disks rather than failing all containers trying to 
> launch on the node.
> This is somewhat related to YARN-91 but is centered around the disk becoming 
> full rather than the disk failing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to