[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

Varun Vasudev (JIRA) Wed, 12 Mar 2014 13:45:09 -0700

     [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Varun Vasudev updated YARN-90:
------------------------------

    Attachment: apache-yarn-90.1.patch

Uploaded new patch.
{quote}
    DirectoryCollection: can you put the block where you create and delete a 
random directory inside a dir.exists() check? We don't want to create-delete a 
directory that already exists but matches with our random string - very 
unlikely but not impossible.
{quote}
Fixed. The dir check is now its own function with the exists check.

{quote}
    ResourceLocalizationService (RLS): What happens to disks that become good 
after service-init? We don't create the top level directories there. Depending 
on our assumptions in the code in the remaining NM subsystem, this may or may 
not lead to bad bugs. Should we permanently exclude bad-disks found during 
initializing?
    Similary in RLS, service-init, we cleanUpLocalDir() to delete old files, If 
disks become good again, we will have unclean disks. And depending on our 
assumptions, we may or may not run into issues. For e.g, files 'leaked' like 
that may never get deleted.
{quote}
Fixed. Local and log dirs undergo a check before use to ensure that they have 
been setup correctly.

{quote}
    Add comments to all the tests describing what is being tested
{quote}
Fixed

{quote}
    Add more inline comments for each test-block, say for e.g. "changing a disk 
to be bad" before a blocker where you change permissions. For readability.
{quote}
Fixed

{quote}
    In all the tests where you sleep for a time more than disk-checker 
frequency, it may or may not pass the test depending on the underlying thread 
scheduling. Instead of that, you should explicitly call 
LocalDirsHandlerService.checkDirs()
{quote}
Fixed, used mocks of the LocalDirsHandlerService removing the timing issue.

{quote}
    TestResourceLocalizationService.testFailedDirsResourceRelease()
        Nonstandard formatting in method declaration
        There is a bit of code about creating container-dirs. Can we reuse some 
of it from ContainerLocalizer?
{quote}
Fixed the non-standard formatting. The ContainerLocalizer code creates only the 
usercache(we need the filecache and the nmPrivate dirs as well).

{quote}
    TestNonAggregatingLogHandler
        In the existing test-case, you have "actually create the dirs". Why is 
that needed?
{quote}
Fixed. Used mocking to remove requirement.

{quote}
        Can we reuse any code in this test with what exists in 
TestLogAggregationService? Seems to me that they both should mostly be the same.
{quote}
Fixed. Shared code moved into functions.

{quote}
    TestDirectoryCollection.testFailedDirPassingCheck -> 
testFailedDisksBecomingGoodAgain
{quote}
Fixed.



> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>
>                 Key: YARN-90
>                 URL: https://issues.apache.org/jira/browse/YARN-90
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ravi Gummadi
>            Assignee: Varun Vasudev
>         Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

Reply via email to