[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

Varun Vasudev (JIRA) Fri, 17 Oct 2014 08:30:13 -0700

     [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Varun Vasudev updated YARN-90:
------------------------------
    Attachment: apache-yarn-90.10.patch

Uploaded a new patch to address [~mingma]'s comments.

{quote}
You and Jason discussed about disk clean up scenario. It will be useful to 
clarify if the following scenario will be resolved by this jira or a separate 
jira is necessary.

1. A disk became ready only. So DiskChecker will mark it as 
DiskErrorCause.OTHER.
2. Later the disk was repaired and became good. There are still data left on 
the disk.
3. Given these data are from old containers which have finished, who will clean 
up these data?
{quote}

Currently this data will not be cleaned up. The admin has to clean it up 
manually. Jason's proposal was to add new functionality that would clean up 
these directories periodically and to tackle that as part of a separate JIRA.

bq. Nit: disksTurnedBad's parameter name preCheckDirs, it is better to name it 
preFailedDirs.

Fixed.

bq. In the getDisksHealthReport, people can't tell if the disk fails due to 
full disk or failed disk, might be useful to distinguish the two cases.

When the disk fails we log a message with the reason for the failure as part of 
the checkDirs function in DirectoryCollection - the disk health report just 
reports the numbers.

bq. verifyDirUsingMkdir, is it necessary given DiskChecker.checkDir will check 
it?

Fixed.

> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>
>                 Key: YARN-90
>                 URL: https://issues.apache.org/jira/browse/YARN-90
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ravi Gummadi
>            Assignee: Varun Vasudev
>         Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
> apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
> apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

Reply via email to