[ 
https://issues.apache.org/jira/browse/YARN-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787491#comment-13787491
 ] 

Vinod Kumar Vavilapalli commented on YARN-1278:
-----------------------------------------------

Here's what happened
{code}
2013-10-05 15:03:57,154 INFO  nodemanager.DefaultContainerExecutor 
(DefaultContainerExecutor.java:startLocalizer(105)) - CWD set to 
/grid/0/hdp/yarn/local/usercache/hrt_qa/appcache/application_1380985373054_0001 
= 
file:/grid/0/hdp/yarn/local/usercache/hrt_qa/appcache/application_1380985373054_0001
2013-10-05 15:03:57,251 INFO  localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:update(910)) - DEBUG: FAILED { 
hdfs://HDFS:8020/user/hrt_qa/.staging/job_1380985373054_0001/job.jar, 
1380985387452, PATTERN, (?:classes/|lib/).* }, Rename cannot overwrite non 
empty destination directory 
/grid/4/hdp/yarn/local/usercache/hrt_qa/appcache/application_1380985373054_0001/filecache/10
2013-10-05 15:03:57,252 INFO  localizer.LocalizedResource 
(LocalizedResource.java:handle(196)) - Resource 
hdfs://HDFS:8020/user/hrt_qa/.staging/job_1380985373054_0001/job.jar 
transitioned from DOWNLOADING to FAILED
2013-10-05 15:03:57,253 INFO  container.Container 
(ContainerImpl.java:handle(871)) - Container 
container_1380985373054_0001_02_000001 transitioned from LOCALIZING to 
LOCALIZATION_FAILED
2013-10-05 15:03:57,253 INFO  localizer.LocalResourcesTrackerImpl 
(LocalResourcesTrackerImpl.java:handle(137)) - Container 
container_1380985373054_0001_02_000001 sent RELEASE event on a resource request 
{ hdfs://HDFS:8020/user/hrt_qa/.staging/job_1380985373054_0001/job.jar, 
1380985387452, PATTERN, (?:classes/|lib/).* } not present in cache.
2013-10-05 15:03:57,254 INFO  localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:processHeartbeat(553)) - Unknown localizer 
with localizerId container_1380985373054_0001_02_000001 is sending heartbeat. 
Ordering it to DIE
{code}

Basically, RM restarted, all NMs were forced to resync. And because of 
YARN-1149, now all Applications are removed from NM but deletion of app 
resources is asynchronous. When new AM starts, it tries to download the 
resources all over again but we generate the local destination path based on 
sequence numbers tracked vai LocalResourcesTracker.nextUniqueNumber. Because 
the original apps are removed, those sequence numbers are lost, so the same app 
tries to relocalize and conflicts local paths.

I think on resync, we shouldn't destroy app resources. That is desired anyways 
as there is no need to just relocalize everything because of RM resync.

> New AM does not start after rm restart
> --------------------------------------
>
>                 Key: YARN-1278
>                 URL: https://issues.apache.org/jira/browse/YARN-1278
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.1.1-beta
>            Reporter: Yesha Vora
>            Priority: Blocker
>
> The new AM fails to start after RM restarts. It fails to start new 
> Application master and job fails with below error.
>  /usr/bin/mapred job -status job_1380985373054_0001
> 13/10/05 15:04:04 INFO client.RMProxy: Connecting to ResourceManager at 
> hostname
> Job: job_1380985373054_0001
> Job File: /user/abc/.staging/job_1380985373054_0001/job.xml
> Job Tracking URL : 
> http://hostname:8088/cluster/app/application_1380985373054_0001
> Uber job : false
> Number of maps: 0
> Number of reduces: 0
> map() completion: 0.0
> reduce() completion: 0.0
> Job state: FAILED
> retired: false
> reason for failure: There are no failed tasks for the job. Job is failed due 
> to some other reason and reason can be found in the logs.
> Counters: 0



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to