[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

Jason Lowe (JIRA) Mon, 29 Sep 2014 14:17:01 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152307#comment-14152307
 ]


Jason Lowe commented on YARN-90:
--------------------------------

Thanks for updating the patch, Varun.

bq. I've changed it to "Disk(s) health report: ". My only concern with this is 
that there might be scripts looking for the "Disk(s) failed" log line for 
monitoring. What do you think?

If that's true then the code should bother to do a diff between the old disk 
list and the new one, logging which disks turned bad using the "Disk(s) failed" 
line and which disks became healthy with some other log message.

bq. Directories are only cleaned up during startup. The code tests for 
existence of the directories and the correct permissions. This does mean that 
container directories left behind for any reason won't get cleaned up unit the 
NodeManager is restarted. Is that ok?

This could still be problematic for the NM work-preserving restart case, as we 
could try to delete an entire disk tree with active containers on it due to a 
hiccup when the NM restarts.  I think a better approach is a periodic cleanup 
scan that looks for directories under yarn-local and yarn-logs that shouldn't 
be there.  This could be part of the health check scan or done separately.  
That way we don't have to wait for a disk to turn good or bad to catch leaked 
entities on the disk due to some hiccup.  Sorta like an fsck for the NM state 
on disk.  That is best done as a separate JIRA, as I think this functionality 
is still an incremental improvement without it.

Other comments:

checkDirs unnecessarily calls union(errorDirs, fullDirs) twice.

isDiskFreeSpaceOverLimt is now named backwards, as the code returns true if the 
free space is under the limit.

getLocalDirsForCleanup and getLogDirsForCleanup should have javadoc comments 
like the other methods.

Nit: The union utility function doesn't technically perform a union but rather 
a concatenation, and it'd be a little clearer if the name reflected that.  Also 
the function should leverage the fact that it knows how big the ArrayList will 
be after the operations and give it the appropriate hint to its constructor to 
avoid reallocations.


> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>
>                 Key: YARN-90
>                 URL: https://issues.apache.org/jira/browse/YARN-90
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ravi Gummadi
>            Assignee: Varun Vasudev
>         Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

Reply via email to