[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

Vinod Kumar Vavilapalli (JIRA) Mon, 04 Nov 2013 12:01:01 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813183#comment-13813183
 ]


Vinod Kumar Vavilapalli commented on YARN-90:
---------------------------------------------

Thanks for the patch, Song! Some quick comments:
 - Because you are changing the semantics of checkDirs(), there are more 
changes that are needed.
  -- updateDirsAfterFailure() -> updateConfAfterDirListChange?
  -- The log message in updateDirsAfterFailure: "Disk(s) failed. " should be 
changed to something like "Disk-health report changed: " or something like that.
 - Web UI and Web-services are fine for now I think, nothing to do there.
 - Drop the extraneous "System.out.println" lines in all of the patch.
 - Let's drop the metrics changes. We need to expose this end-to-end and not 
just metrics - client side reports, jmx and metrics. Worth tracking that effort 
separately.
 - Test:
    -- testAutoDir() -> testDisksGoingOnAndOff ?
    -- Can you also validate the health-report both when disks go off and when 
they come back again?
    -- Also just throw unwanted exceptions instead of catching them and 
printing stack-trace.

> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>
>                 Key: YARN-90
>                 URL: https://issues.apache.org/jira/browse/YARN-90
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ravi Gummadi
>         Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

Reply via email to