[
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Varun Vasudev updated YARN-90:
------------------------------
Attachment: apache-yarn-90.1.patch
Uploaded new patch.
{quote}
DirectoryCollection: can you put the block where you create and delete a
random directory inside a dir.exists() check? We don't want to create-delete a
directory that already exists but matches with our random string - very
unlikely but not impossible.
{quote}
Fixed. The dir check is now its own function with the exists check.
{quote}
ResourceLocalizationService (RLS): What happens to disks that become good
after service-init? We don't create the top level directories there. Depending
on our assumptions in the code in the remaining NM subsystem, this may or may
not lead to bad bugs. Should we permanently exclude bad-disks found during
initializing?
Similary in RLS, service-init, we cleanUpLocalDir() to delete old files, If
disks become good again, we will have unclean disks. And depending on our
assumptions, we may or may not run into issues. For e.g, files 'leaked' like
that may never get deleted.
{quote}
Fixed. Local and log dirs undergo a check before use to ensure that they have
been setup correctly.
{quote}
Add comments to all the tests describing what is being tested
{quote}
Fixed
{quote}
Add more inline comments for each test-block, say for e.g. "changing a disk
to be bad" before a blocker where you change permissions. For readability.
{quote}
Fixed
{quote}
In all the tests where you sleep for a time more than disk-checker
frequency, it may or may not pass the test depending on the underlying thread
scheduling. Instead of that, you should explicitly call
LocalDirsHandlerService.checkDirs()
{quote}
Fixed, used mocks of the LocalDirsHandlerService removing the timing issue.
{quote}
TestResourceLocalizationService.testFailedDirsResourceRelease()
Nonstandard formatting in method declaration
There is a bit of code about creating container-dirs. Can we reuse some
of it from ContainerLocalizer?
{quote}
Fixed the non-standard formatting. The ContainerLocalizer code creates only the
usercache(we need the filecache and the nmPrivate dirs as well).
{quote}
TestNonAggregatingLogHandler
In the existing test-case, you have "actually create the dirs". Why is
that needed?
{quote}
Fixed. Used mocking to remove requirement.
{quote}
Can we reuse any code in this test with what exists in
TestLogAggregationService? Seems to me that they both should mostly be the same.
{quote}
Fixed. Shared code moved into functions.
{quote}
TestDirectoryCollection.testFailedDirPassingCheck ->
testFailedDisksBecomingGoodAgain
{quote}
Fixed.
> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Reporter: Ravi Gummadi
> Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch,
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes
> down, it is marked as failed forever. To reuse that disk (after it becomes
> good), NodeManager needs restart. This JIRA is to improve NodeManager to
> reuse good disks(which could be bad some time back).
--
This message was sent by Atlassian JIRA
(v6.2#6252)