[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories

Jason Lowe (JIRA) Tue, 23 Jun 2015 09:12:15 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597873#comment-14597873
 ]


Jason Lowe commented on YARN-3832:
----------------------------------

Ah, I think that might be the clue as to what went wrong.  If the NM recreated 
the state store on startup then ResourceLocalizationService will try to cleanup 
the localized resources to prevent them from getting out of sync with the state 
store.  Unfortunately the code does this:
{code}
  private void cleanUpLocalDirs(FileContext lfs, DeletionService del) {
    for (String localDir : dirsHandler.getLocalDirs()) {
      cleanUpLocalDir(lfs, del, localDir);
    }
{code}

It should be calling dirsHandler.getLocalDirsForCleanup, since getLocalDirs 
will not include any disks that are full.  Since the disk was too full, it 
probably wasn't in the list of local dirs and therefore we avoided cleaning up 
the localized resources on the disk.  Later when the disk became good it tried 
to use it, but at that point the state store and localized resources on that 
disk are out of sync and new localizations can collide with old ones.

> Resource Localization fails on a cluster due to existing cache directories
> --------------------------------------------------------------------------
>
>                 Key: YARN-3832
>                 URL: https://issues.apache.org/jira/browse/YARN-3832
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.0
>            Reporter: Ranga Swamy
>            Assignee: Brahma Reddy Battula
>
>  *We have found resource localization fails on a cluster with following 
> error.* 
>  
> Got this error in hadoop-2.7.0 release which was fixed in 2.6.0 (YARN-2624)
> {noformat}
> Application application_1434703279149_0057 failed 2 times due to AM Container 
> for appattempt_1434703279149_0057_000002 exited with exitCode: -1000
> For more detailed output, check application tracking 
> page:http://S0559LDPag68:45020/cluster/app/application_1434703279149_0057Then,
>  click on links to logs of each attempt.
> Diagnostics: Rename cannot overwrite non empty destination directory 
> /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
> java.io.IOException: Rename cannot overwrite non empty destination directory 
> /opt/hdfsdata/HA/nmlocal/usercache/root/filecache/39
> at 
> org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:735)
> at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:244)
> at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678)
> at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Failing this attempt. Failing the application.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3832) Resource Localization fails on a cluster due to existing cache directories

Reply via email to