[jira] [Commented] (YARN-257) NM should gracefully handle a full local disk
[ https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919295#comment-13919295 ] Sunil G commented on YARN-257: -- May be NM can do some level of handling by itself in Disk Full scenario as in first place. NM's LocalDirAllocator gives a local path to write from the "good" list of directories. But for this, it uses a round robin algorithm based on space available. In a scenario like below, if more tasks asks for path from the set of local directories, then it is possible that the allocation is done based on the current availability at that given time. But this path would have earlier given to some other tasks to write and they may be sequentially doing writing. Basically the allotted space is not considered when next allocation is given for another task from same path. [Assuming few earlier allocated tasks is doing write at this time] But it is not possible to consider this earlier allotted space and it is not possible to predict the disk write speed. Could it be possible to predict disk full scenario rather than acting on when it happens. For Eg, current health check mechanism will check access permission etc to identify and good and bad directories for 2 minute interval. Here if the space is almost full (say 95% or only 5*100Mb is remaining), then it is better to move that directory to bad list directories. Or in the LocalDirAllocator, it is better to check for high percentage of disk used. And do not assign such a directory to that task. These measures might possible help to resolve the new tasks not to fail because of an immediate disk full scenario. > NM should gracefully handle a full local disk > - > > Key: YARN-257 > URL: https://issues.apache.org/jira/browse/YARN-257 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.0.2-alpha, 0.23.5 >Reporter: Jason Lowe > > When a local disk becomes full, the node will fail every container launched > on it because the container is unable to localize. It tries to create an > app-specific directory for each local and log directories. If any of those > directory creates fail (due to lack of free space) the container fails. > It would be nice if the node could continue to launch containers using the > space available on other disks rather than failing all containers trying to > launch on the node. > This is somewhat related to YARN-91 but is centered around the disk becoming > full rather than the disk failing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-257) NM should gracefully handle a full local disk
[ https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752764#comment-13752764 ] Eli Collins commented on YARN-257: -- This seems like a blocker for GA given that MR1 handles disk failures. > NM should gracefully handle a full local disk > - > > Key: YARN-257 > URL: https://issues.apache.org/jira/browse/YARN-257 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.0.2-alpha, 0.23.5 >Reporter: Jason Lowe > > When a local disk becomes full, the node will fail every container launched > on it because the container is unable to localize. It tries to create an > app-specific directory for each local and log directories. If any of those > directory creates fail (due to lack of free space) the container fails. > It would be nice if the node could continue to launch containers using the > space available on other disks rather than failing all containers trying to > launch on the node. > This is somewhat related to YARN-91 but is centered around the disk becoming > full rather than the disk failing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-257) NM should gracefully handle a full local disk
[ https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510842#comment-13510842 ] Jason Lowe commented on YARN-257: - bq. Before the complete change, would it help if the NM did not accept new containers. Maybe by indicating in the heartbeat that do not assign containers to it. Yes, it would be nice sometimes if a node could declare itself as being UNHEALTHY without causing all containers currently running to be shot as it does now. Sort of a "let's drain the currently running containers but not allow any new ones" mode. bq. Why does the RM not notice abnormal failure rates on such an NM and put it out of rotation for scheduling? Currently the RM doesn't track container failures on nodes for purposes of blacklisting them. AFAIK nodes can only be blacklisted by an RM by self-declaring themselves as UNHEALTHY via the health checker script that they run. The MR AM is already tracking such things, but I don't beleive there's a feedback mechanism from the AM to the RM to help the RM figure out which nodes are bad from an AM's perspective. Might be nice to have, and YARN-195 covers this to some extent. As you indicate the RM could also check container failures solely via container status from the NMs and blacklist NMs based on some algorithm. We need to be careful that a misconfigured large job doesn't end up blacklisting a large chunk of the cluster because all of its containers fail. Think bad parameters on mapreduce.map.java.opts, for example, or a case where it doesn't get the classpath for its tasks correct. And not all container failures from an AMs point of view are visible from the RM watching container status. The container could exit cleanly but still fail at the app-level, for example. So we might need both mechanisms. > NM should gracefully handle a full local disk > - > > Key: YARN-257 > URL: https://issues.apache.org/jira/browse/YARN-257 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.0.2-alpha, 0.23.5 >Reporter: Jason Lowe > > When a local disk becomes full, the node will fail every container launched > on it because the container is unable to localize. It tries to create an > app-specific directory for each local and log directories. If any of those > directory creates fail (due to lack of free space) the container fails. > It would be nice if the node could continue to launch containers using the > space available on other disks rather than failing all containers trying to > launch on the node. > This is somewhat related to YARN-91 but is centered around the disk becoming > full rather than the disk failing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-257) NM should gracefully handle a full local disk
[ https://issues.apache.org/jira/browse/YARN-257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510812#comment-13510812 ] Bikas Saha commented on YARN-257: - Before the complete change, would it help if the NM did not accept new containers. Maybe by indicating in the heartbeat that do not assign containers to it. Why does the RM not notice abnormal failure rates on such an NM and put it out of rotation for scheduling? > NM should gracefully handle a full local disk > - > > Key: YARN-257 > URL: https://issues.apache.org/jira/browse/YARN-257 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.0.2-alpha, 0.23.5 >Reporter: Jason Lowe > > When a local disk becomes full, the node will fail every container launched > on it because the container is unable to localize. It tries to create an > app-specific directory for each local and log directories. If any of those > directory creates fail (due to lack of free space) the container fails. > It would be nice if the node could continue to launch containers using the > space available on other disks rather than failing all containers trying to > launch on the node. > This is somewhat related to YARN-91 but is centered around the disk becoming > full rather than the disk failing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira