[
https://issues.apache.org/jira/browse/YARN-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787491#comment-13787491
]
Vinod Kumar Vavilapalli commented on YARN-1278:
-----------------------------------------------
Here's what happened
{code}
2013-10-05 15:03:57,154 INFO nodemanager.DefaultContainerExecutor
(DefaultContainerExecutor.java:startLocalizer(105)) - CWD set to
/grid/0/hdp/yarn/local/usercache/hrt_qa/appcache/application_1380985373054_0001
=
file:/grid/0/hdp/yarn/local/usercache/hrt_qa/appcache/application_1380985373054_0001
2013-10-05 15:03:57,251 INFO localizer.ResourceLocalizationService
(ResourceLocalizationService.java:update(910)) - DEBUG: FAILED {
hdfs://HDFS:8020/user/hrt_qa/.staging/job_1380985373054_0001/job.jar,
1380985387452, PATTERN, (?:classes/|lib/).* }, Rename cannot overwrite non
empty destination directory
/grid/4/hdp/yarn/local/usercache/hrt_qa/appcache/application_1380985373054_0001/filecache/10
2013-10-05 15:03:57,252 INFO localizer.LocalizedResource
(LocalizedResource.java:handle(196)) - Resource
hdfs://HDFS:8020/user/hrt_qa/.staging/job_1380985373054_0001/job.jar
transitioned from DOWNLOADING to FAILED
2013-10-05 15:03:57,253 INFO container.Container
(ContainerImpl.java:handle(871)) - Container
container_1380985373054_0001_02_000001 transitioned from LOCALIZING to
LOCALIZATION_FAILED
2013-10-05 15:03:57,253 INFO localizer.LocalResourcesTrackerImpl
(LocalResourcesTrackerImpl.java:handle(137)) - Container
container_1380985373054_0001_02_000001 sent RELEASE event on a resource request
{ hdfs://HDFS:8020/user/hrt_qa/.staging/job_1380985373054_0001/job.jar,
1380985387452, PATTERN, (?:classes/|lib/).* } not present in cache.
2013-10-05 15:03:57,254 INFO localizer.ResourceLocalizationService
(ResourceLocalizationService.java:processHeartbeat(553)) - Unknown localizer
with localizerId container_1380985373054_0001_02_000001 is sending heartbeat.
Ordering it to DIE
{code}
Basically, RM restarted, all NMs were forced to resync. And because of
YARN-1149, now all Applications are removed from NM but deletion of app
resources is asynchronous. When new AM starts, it tries to download the
resources all over again but we generate the local destination path based on
sequence numbers tracked vai LocalResourcesTracker.nextUniqueNumber. Because
the original apps are removed, those sequence numbers are lost, so the same app
tries to relocalize and conflicts local paths.
I think on resync, we shouldn't destroy app resources. That is desired anyways
as there is no need to just relocalize everything because of RM resync.
> New AM does not start after rm restart
> --------------------------------------
>
> Key: YARN-1278
> URL: https://issues.apache.org/jira/browse/YARN-1278
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.1.1-beta
> Reporter: Yesha Vora
> Priority: Blocker
>
> The new AM fails to start after RM restarts. It fails to start new
> Application master and job fails with below error.
> /usr/bin/mapred job -status job_1380985373054_0001
> 13/10/05 15:04:04 INFO client.RMProxy: Connecting to ResourceManager at
> hostname
> Job: job_1380985373054_0001
> Job File: /user/abc/.staging/job_1380985373054_0001/job.xml
> Job Tracking URL :
> http://hostname:8088/cluster/app/application_1380985373054_0001
> Uber job : false
> Number of maps: 0
> Number of reduces: 0
> map() completion: 0.0
> reduce() completion: 0.0
> Job state: FAILED
> retired: false
> reason for failure: There are no failed tasks for the job. Job is failed due
> to some other reason and reason can be found in the logs.
> Counters: 0
--
This message was sent by Atlassian JIRA
(v6.1#6144)