Ming Ma commented on YARN-90:

Thanks, Varun, Jason. Couple comments:

1. What if a dir is transitioned from DISK_FULL state to OTHER state? 
DirectoryCollection.checkDirs doesn't seem to update errorDirs and fullDirs 
properly. We can use some state machine for each dir and make sure each 
transition is covered.

2. DISK_FULL state is counted toward the error disk threshold by 
LocalDirsHandlerService.areDisksHealthy; later RM could mark NM NODE_UNUSABLE. 
If we believe DISK_FULL is mostly temporary issue, should we consider disks are 
healthy if disks only stay in DISK_FULL for some short period of time?

3. In AppLogAggregatorImpl.java, "(Path[]) localAppLogDirs.toArray(new 
Path[localAppLogDirs.size()]).". It seems the (Path[]) cast isn't necessary.

4. What is the intention of numFailures? Method getNumFailures isn't used.

5. Nit: It is better to expand "import java.util.*;" in 
DirectoryCollection.java and LocalDirsHandlerService.java.

> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>                 Key: YARN-90
>                 URL: https://issues.apache.org/jira/browse/YARN-90
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ravi Gummadi
>            Assignee: Varun Vasudev
>         Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).

This message was sent by Atlassian JIRA

Reply via email to