[
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813183#comment-13813183
]
Vinod Kumar Vavilapalli commented on YARN-90:
---------------------------------------------
Thanks for the patch, Song! Some quick comments:
- Because you are changing the semantics of checkDirs(), there are more
changes that are needed.
-- updateDirsAfterFailure() -> updateConfAfterDirListChange?
-- The log message in updateDirsAfterFailure: "Disk(s) failed. " should be
changed to something like "Disk-health report changed: " or something like that.
- Web UI and Web-services are fine for now I think, nothing to do there.
- Drop the extraneous "System.out.println" lines in all of the patch.
- Let's drop the metrics changes. We need to expose this end-to-end and not
just metrics - client side reports, jmx and metrics. Worth tracking that effort
separately.
- Test:
-- testAutoDir() -> testDisksGoingOnAndOff ?
-- Can you also validate the health-report both when disks go off and when
they come back again?
-- Also just throw unwanted exceptions instead of catching them and
printing stack-trace.
> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Reporter: Ravi Gummadi
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes
> down, it is marked as failed forever. To reuse that disk (after it becomes
> good), NodeManager needs restart. This JIRA is to improve NodeManager to
> reuse good disks(which could be bad some time back).
--
This message was sent by Atlassian JIRA
(v6.1#6144)