[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

Varun Vasudev (JIRA) Wed, 01 Oct 2014 15:29:59 -0700

     [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Varun Vasudev updated YARN-90:
------------------------------
    Attachment: apache-yarn-90.8.patch

Thanks for the review [~mingma]!

{quote}
1. What if a dir is transitioned from DISK_FULL state to OTHER state? 
DirectoryCollection.checkDirs doesn't seem to update errorDirs and fullDirs 
properly. We can use some state machine for each dir and make sure each 
transition is covered.
{quote}

Fixed. I've re-written the checkDir function but I haven't used a state 
machine. Can you please review?

{quote}
2. DISK_FULL state is counted toward the error disk threshold by 
LocalDirsHandlerService.areDisksHealthy; later RM could mark NM NODE_UNUSABLE. 
If we believe DISK_FULL is mostly temporary issue, should we consider disks are 
healthy if disks only stay in DISK_FULL for some short period of time?
{quote}

The issue here is that if a disk is full, we can't launch new containers on it. 
If we can't launch containers, the RM should consider the node is unhealthy. 
Once the disk is cleaned up, the RM will assign containers to it.

{quote}
3. In AppLogAggregatorImpl.java, "(Path[]) localAppLogDirs.toArray(new 
Path\[localAppLogDirs.size()]).". It seems the (Path[]) cast isn't necessary.
{quote}

Fixed.

{quote}
4. What is the intention of numFailures? Method getNumFailures isn't used.
{quote}

This is a carry over function - it existed as part of the existing 
implementation.

{quote}
5. Nit: It is better to expand "import java.util.*;" in 
DirectoryCollection.java and LocalDirsHandlerService.java.
{quote}

Fixed.

> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>
>                 Key: YARN-90
>                 URL: https://issues.apache.org/jira/browse/YARN-90
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ravi Gummadi
>            Assignee: Varun Vasudev
>         Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, 
> apache-yarn-90.8.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

Reply via email to