[jira] [Commented] (YARN-257) NM should gracefully handle a full local disk

2014-03-04 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919295#comment-13919295
 ] 

Sunil G commented on YARN-257:
--

May be NM can do some level of handling by itself in Disk Full scenario as in 
first place.
NM's LocalDirAllocator gives a local path to write from the "good" list of 
directories.
But for this, it uses a round robin algorithm based on space available.

In a scenario like below, if more tasks asks for path from the set of local 
directories, 
then it is possible that the allocation is done based on the current 
availability at that given time.
But this path would have earlier given to some other tasks to write and they 
may be sequentially doing writing.

Basically the allotted space is not considered when next allocation is given 
for another task from same path. 
[Assuming few earlier allocated tasks is doing write at this time]

But it is not possible to consider this earlier allotted space and it is not 
possible to predict the disk write speed.

Could it be possible to predict disk full scenario rather than acting on when 
it happens.
For Eg, current health check mechanism will check access permission etc to 
identify and good and bad directories for 2 minute interval.
Here if the space is almost full (say 95% or only 5*100Mb is remaining), then 
it is better to move that directory to bad list directories.

Or in the LocalDirAllocator, it is better to check for high percentage of disk 
used. And do not assign such a directory to that task.
These measures might possible help to resolve the new tasks not to fail because 
of an immediate disk full scenario.

> NM should gracefully handle a full local disk
> -
>
> Key: YARN-257
> URL: https://issues.apache.org/jira/browse/YARN-257
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.0.2-alpha, 0.23.5
>Reporter: Jason Lowe
>
> When a local disk becomes full, the node will fail every container launched 
> on it because the container is unable to localize.  It tries to create an 
> app-specific directory for each local and log directories.  If any of those 
> directory creates fail (due to lack of free space) the container fails.
> It would be nice if the node could continue to launch containers using the 
> space available on other disks rather than failing all containers trying to 
> launch on the node.
> This is somewhat related to YARN-91 but is centered around the disk becoming 
> full rather than the disk failing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-257) NM should gracefully handle a full local disk

2013-08-28 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752764#comment-13752764
 ] 

Eli Collins commented on YARN-257:
--

This seems like a blocker for GA given that MR1 handles disk failures.

> NM should gracefully handle a full local disk
> -
>
> Key: YARN-257
> URL: https://issues.apache.org/jira/browse/YARN-257
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.0.2-alpha, 0.23.5
>Reporter: Jason Lowe
>
> When a local disk becomes full, the node will fail every container launched 
> on it because the container is unable to localize.  It tries to create an 
> app-specific directory for each local and log directories.  If any of those 
> directory creates fail (due to lack of free space) the container fails.
> It would be nice if the node could continue to launch containers using the 
> space available on other disks rather than failing all containers trying to 
> launch on the node.
> This is somewhat related to YARN-91 but is centered around the disk becoming 
> full rather than the disk failing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-257) NM should gracefully handle a full local disk

2012-12-05 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510842#comment-13510842
 ] 

Jason Lowe commented on YARN-257:
-

bq. Before the complete change, would it help if the NM did not accept new 
containers. Maybe by indicating in the heartbeat that do not assign containers 
to it.

Yes, it would be nice sometimes if a node could declare itself as being 
UNHEALTHY without causing all containers currently running to be shot as it 
does now.  Sort of a "let's drain the currently running containers but not 
allow any new ones" mode.

bq. Why does the RM not notice abnormal failure rates on such an NM and put it 
out of rotation for scheduling?

Currently the RM doesn't track container failures on nodes for purposes of 
blacklisting them.  AFAIK nodes can only be blacklisted by an RM by 
self-declaring themselves as UNHEALTHY via the health checker script that they 
run.  The MR AM is already tracking such things, but I don't beleive there's a 
feedback mechanism from the AM to the RM to help the RM figure out which nodes 
are bad from an AM's perspective.  Might be nice to have, and YARN-195 covers 
this to some extent.

As you indicate the RM could also check container failures solely via container 
status from the NMs and blacklist NMs based on some algorithm.  We need to be 
careful that a misconfigured large job doesn't end up blacklisting a large 
chunk of the cluster because all of its containers fail.  Think bad parameters 
on mapreduce.map.java.opts, for example, or a case where it doesn't get the 
classpath for its tasks correct.  And not all container failures from an AMs 
point of view are visible from the RM watching container status.  The container 
could exit cleanly but still fail at the app-level, for example.  So we might 
need both mechanisms.


> NM should gracefully handle a full local disk
> -
>
> Key: YARN-257
> URL: https://issues.apache.org/jira/browse/YARN-257
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.0.2-alpha, 0.23.5
>Reporter: Jason Lowe
>
> When a local disk becomes full, the node will fail every container launched 
> on it because the container is unable to localize.  It tries to create an 
> app-specific directory for each local and log directories.  If any of those 
> directory creates fail (due to lack of free space) the container fails.
> It would be nice if the node could continue to launch containers using the 
> space available on other disks rather than failing all containers trying to 
> launch on the node.
> This is somewhat related to YARN-91 but is centered around the disk becoming 
> full rather than the disk failing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-257) NM should gracefully handle a full local disk

2012-12-05 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510812#comment-13510812
 ] 

Bikas Saha commented on YARN-257:
-

Before the complete change, would it help if the NM did not accept new 
containers. Maybe by indicating in the heartbeat that do not assign containers 
to it.
Why does the RM not notice abnormal failure rates on such an NM and put it out 
of rotation for scheduling?

> NM should gracefully handle a full local disk
> -
>
> Key: YARN-257
> URL: https://issues.apache.org/jira/browse/YARN-257
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.0.2-alpha, 0.23.5
>Reporter: Jason Lowe
>
> When a local disk becomes full, the node will fail every container launched 
> on it because the container is unable to localize.  It tries to create an 
> app-specific directory for each local and log directories.  If any of those 
> directory creates fail (due to lack of free space) the container fails.
> It would be nice if the node could continue to launch containers using the 
> space available on other disks rather than failing all containers trying to 
> launch on the node.
> This is somewhat related to YARN-91 but is centered around the disk becoming 
> full rather than the disk failing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira