[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-10-17 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--
Attachment: apache-yarn-90.10.patch

Uploaded a new patch to address [~mingma]'s comments.

{quote}
You and Jason discussed about disk clean up scenario. It will be useful to 
clarify if the following scenario will be resolved by this jira or a separate 
jira is necessary.

1. A disk became ready only. So DiskChecker will mark it as 
DiskErrorCause.OTHER.
2. Later the disk was repaired and became good. There are still data left on 
the disk.
3. Given these data are from old containers which have finished, who will clean 
up these data?
{quote}

Currently this data will not be cleaned up. The admin has to clean it up 
manually. Jason's proposal was to add new functionality that would clean up 
these directories periodically and to tackle that as part of a separate JIRA.

bq. Nit: disksTurnedBad's parameter name preCheckDirs, it is better to name it 
preFailedDirs.

Fixed.

bq. In the getDisksHealthReport, people can't tell if the disk fails due to 
full disk or failed disk, might be useful to distinguish the two cases.

When the disk fails we log a message with the reason for the failure as part of 
the checkDirs function in DirectoryCollection - the disk health report just 
reports the numbers.

bq. verifyDirUsingMkdir, is it necessary given DiskChecker.checkDir will check 
it?

Fixed.

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
 apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
 apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-10-14 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--
Attachment: apache-yarn-90.9.patch

Uploaded a new patch to address comments by [~mingma] and [~zxu].

bq. Nit: For SetString postCheckFullDirs = new HashSetString(fullDirs);. 
It doesn't have to create postCheckFullDirs. It can directly refer to fullDirs 
later.

It was just to ease lookups - instead of searching through a list, lookup a 
set. If you feel strongly about it, I can change it.

{quote}
can change
if (!postCheckFullDirs.contains(dir)  postCheckOtherDirs.contains(dir)) {
to
if (postCheckOtherDirs.contains(dir)) {
{quote}

Fixed.

{quote}
change
if (!postCheckOtherDirs.contains(dir)  postCheckFullDirs.contains(dir)) {
to
if (postCheckFullDirs.contains(dir)) {
{quote}

Fixed.

{quote}
3. in verifyDirUsingMkdir:
Can we add int variable to file name to avoid loop forever(although it is a 
very small chance) like the following?
long i = 0L;
while (target.exists())
\{ randomDirName = RandomStringUtils.randomAlphanumeric(5) + i++; target = new 
File(dir, randomDirName); }
{quote}

Fixed.

{quote}
4. in disksTurnedBad:
Can we add break in the loop when disksFailed is true so we exit the loop 
earlier?
if (!preCheckDirs.contains(dir))
\{ disksFailed = true; break; }
{quote}

Fixed.

{quote}
5. in disksTurnedGood same as item 4:
Can we add break in the loop when disksTurnedGood is true?
{quote}

Fixed.

{quote}
In function verifyDirUsingMkdir, target.exists(), target.mkdir() and 
FileUtils.deleteQuietly(target) is not atomic,
What happen if another thread try to create the same directory(target)?
{quote}

verifyDirUsingMkdir is called by testDirs which is called by checkDirs() which 
is synchronized.

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
 apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, 
 apache-yarn-90.8.patch, apache-yarn-90.9.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-30 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--
Attachment: apache-yarn-90.7.patch

Uploaded a new patch to address the comments by Jason.

{quote}

bq.I've changed it to Disk(s) health report: . My only concern with this 
is that there might be scripts looking for the Disk(s) failed log line for 
monitoring. What do you think?

If that's true then the code should bother to do a diff between the old disk 
list and the new one, logging which disks turned bad using the Disk(s) failed 
line and which disks became healthy with some other log message.
{quote}

Fixed. We now have two log messages - one indicating when disks go bad and one 
when disks get marked as good.

{quote}
bq.Directories are only cleaned up during startup. The code tests for 
existence of the directories and the correct permissions. This does mean that 
container directories left behind for any reason won't get cleaned up unit the 
NodeManager is restarted. Is that ok?

This could still be problematic for the NM work-preserving restart case, as we 
could try to delete an entire disk tree with active containers on it due to a 
hiccup when the NM restarts. I think a better approach is a periodic cleanup 
scan that looks for directories under yarn-local and yarn-logs that shouldn't 
be there. This could be part of the health check scan or done separately. That 
way we don't have to wait for a disk to turn good or bad to catch leaked 
entities on the disk due to some hiccup. Sorta like an fsck for the NM state on 
disk. That is best done as a separate JIRA, as I think this functionality is 
still an incremental improvement without it.
{quote}

The current code will only cleanup if the NM recovery can't be carried out.
{noformat}
  if (!stateStore.canRecover()) {
cleanUpLocalDirs(lfs, delService);
initializeLocalDirs(lfs);
initializeLogDirs(lfs);
  }
{noformat}
Will that handle the case you mentioned?

bq. checkDirs unnecessarily calls union(errorDirs, fullDirs) twice.

Fixed.

bq. isDiskFreeSpaceOverLimt is now named backwards, as the code returns true if 
the free space is under the limit.

Fixed.
bq. getLocalDirsForCleanup and getLogDirsForCleanup should have javadoc 
comments like the other methods.

Fixed.

{quote}
Nit: The union utility function doesn't technically perform a union but rather 
a concatenation, and it'd be a little clearer if the name reflected that. Also 
the function should leverage the fact that it knows how big the ArrayList will 
be after the operations and give it the appropriate hint to its constructor to 
avoid reallocations.
{quote}

Fixed.

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
 apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-24 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--
Attachment: apache-yarn-90.5.patch

Uploaded a new patch to address [~jlowe]'s comments.

{quote}
It's a bit odd to have a hash map to map disk error types to lists of 
directories, fill them all in, but we only in practice actually look at one 
type in the map and that's DISK_FULL. It'd be simpler (and faster and less 
space since there's no hashmap involved) to just track full disks as a separate 
collection like we already do for localDirs and failedDirs.
{quote}

Fixed. I renamed failedDirs to errorDirs and added a list for fullDirs. The 
getFailedDirs() function returns a union of the two.

{quote}
Nit: DISK_ERROR_CAUSE should be DiskErrorCause (if we keep the enum) to match 
the style of other enum types in the code.
{quote}

Fixed.

{quote}
In verifyDirUsingMkdir, if an error occurs during the finally clause then that 
exception will mask the original exception
{quote}

Fixed.

{quote}
isDiskUsageUnderPercentageLimit is named backwards. Disk usage being under the 
configured limit shouldn't be a full disk error, and the error message is 
inconsistent with the method name (method talks about being under but error 
message says its above).
{noformat}
if (isDiskUsageUnderPercentageLimit(testDir)) {
  msg =
  used space above threshold of 
  + diskUtilizationPercentageCutoff
  + %, removing from the list of valid directories.;
{noformat}
{quote}

Yep, thanks for catching it. Fixed.

{quote}
We should only call getDisksHealthReport() once in the following code:
{noformat}
+String report = getDisksHealthReport();
+if (!report.isEmpty()) {
+  LOG.info(Disk(s) failed.  + getDisksHealthReport());
{noformat}
{quote}
Fixed.

{quote}
Should updateDirsAfterTest always say Disk(s) failed if the report isn't 
empty? Thinking of the case where two disks go bad, then one later is restored. 
The health report will still have something, but that last update is a disk 
turning good not failing. Before this code was only called when a new disk 
failed, and now that's not always the case. Maybe it should just be something 
like Disk health update:  instead?
{quote}

I've changed it to Disk(s) health report: . My only concern with this is that 
there might be scripts looking for the Disk(s) failed log line for 
monitoring. What do you think?

{quote}
Is it really necessary to stat a directory before we try to delete it? Seems 
like we can just try to delete it.
{quote}

Just wanted to avoid an unnecessary attempt. If a disk is comes back as good 
when a container is running, it won't have the container directories leading to 
an unnecessary delete.

{quote}
The idiom of getting the directories and adding the full directories seems 
pretty common. Might be good to have dirhandler methods that already do this, 
like getLocalDirsForCleanup or getLogDirsForCleanup.
{quote}

Fixed.

{quote}
I'm a bit worried that getInitializedLocalDirs could potentially try to delete 
an entire directory tree for a disk. If this fails in some sector-specific way 
but other containers are currently using their files from other sectors just 
fine on the same disk, removing these files from underneath active containers 
could be very problematic and difficult to debug.
{quote}

Fixed. Directories are only cleaned up during startup. The code tests for 
existence of the directories and the correct permissions. This does mean that 
container directories left behind for any reason won't get cleaned up unit the 
NodeManager is restarted. Is that ok?

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
 apache-yarn-90.5.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-24 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--
Attachment: apache-yarn-90.6.patch

Uploaded patch with fixed warnings and test cases.

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
 apache-yarn-90.5.patch, apache-yarn-90.6.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-19 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--
Attachment: apache-yarn-90.3.patch

Rebase patch to trunk and small improvements to attempt cleanups on full 
directories.

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.2.patch, apache-yarn-90.3.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-19 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--
Attachment: apache-yarn-90.4.patch

Patch with findbugs fix

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-13 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--

Attachment: apache-yarn-90.2.patch

Fixed issue that caused the patch application to fail.

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.2.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-12 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--

Attachment: apache-yarn-90.1.patch

Uploaded new patch.
{quote}
DirectoryCollection: can you put the block where you create and delete a 
random directory inside a dir.exists() check? We don't want to create-delete a 
directory that already exists but matches with our random string - very 
unlikely but not impossible.
{quote}
Fixed. The dir check is now its own function with the exists check.

{quote}
ResourceLocalizationService (RLS): What happens to disks that become good 
after service-init? We don't create the top level directories there. Depending 
on our assumptions in the code in the remaining NM subsystem, this may or may 
not lead to bad bugs. Should we permanently exclude bad-disks found during 
initializing?
Similary in RLS, service-init, we cleanUpLocalDir() to delete old files, If 
disks become good again, we will have unclean disks. And depending on our 
assumptions, we may or may not run into issues. For e.g, files 'leaked' like 
that may never get deleted.
{quote}
Fixed. Local and log dirs undergo a check before use to ensure that they have 
been setup correctly.

{quote}
Add comments to all the tests describing what is being tested
{quote}
Fixed

{quote}
Add more inline comments for each test-block, say for e.g. changing a disk 
to be bad before a blocker where you change permissions. For readability.
{quote}
Fixed

{quote}
In all the tests where you sleep for a time more than disk-checker 
frequency, it may or may not pass the test depending on the underlying thread 
scheduling. Instead of that, you should explicitly call 
LocalDirsHandlerService.checkDirs()
{quote}
Fixed, used mocks of the LocalDirsHandlerService removing the timing issue.

{quote}
TestResourceLocalizationService.testFailedDirsResourceRelease()
Nonstandard formatting in method declaration
There is a bit of code about creating container-dirs. Can we reuse some 
of it from ContainerLocalizer?
{quote}
Fixed the non-standard formatting. The ContainerLocalizer code creates only the 
usercache(we need the filecache and the nmPrivate dirs as well).

{quote}
TestNonAggregatingLogHandler
In the existing test-case, you have actually create the dirs. Why is 
that needed?
{quote}
Fixed. Used mocking to remove requirement.

{quote}
Can we reuse any code in this test with what exists in 
TestLogAggregationService? Seems to me that they both should mostly be the same.
{quote}
Fixed. Shared code moved into functions.

{quote}
TestDirectoryCollection.testFailedDirPassingCheck - 
testFailedDisksBecomingGoodAgain
{quote}
Fixed.



 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-12 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--

Attachment: apache-yarn-90.1.patch

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-10 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--

Attachment: apache-yarn-90.0.patch

Added patch which brings back failed disks once they pass checks. Patch also 
fixes cleanups of local and log dirs to cleanup directories on disks which 
might have gone bad while the app was running.

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-11-15 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-90:
-

Attachment: YARN-90.patch

Thanks a lot Nigel and Song. Making the changes that I requested to push it 
over the line. The same patch applies cleanly to branch-2 as well. Could 
someone kindly review and commit it?



 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-11-05 Thread Hou Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hou Song updated YARN-90:
-

Attachment: YARN-90.patch

Now I understand, thanks. 
Please review this patch first, and will open a new jira for the information 
exporsure soon. 

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-11-03 Thread Hou Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hou Song updated YARN-90:
-

Attachment: YARN-90.patch

This is my patch. 
Pls kindly review it and give me feedbacks.
Thx.

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-09-25 Thread nijel (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nijel updated YARN-90:
--

Attachment: YARN-90.1.patch

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
 Attachments: YARN-90.1.patch, YARN-90.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira