[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100766#comment-16100766
 ] 

Jian He commented on YARN-6031:
-------------------------------

bq. In RMAppManager#createAndPopulateNewRMApp, app is just created whether its 
in submission/recovery mode. Attempt is not yet created. Hence I think this 
wont be a problem.
The scenario is this: the RMApp is now transitioned to failed and the state is 
persisted in store, but attempt state is still null. If next time the admin 
re-enables node label, RMApp will be recovered as FAILED, but attempt state 
will be NULL.

bq.  Hence recovery for other apps will also continue and we will have context 
of this app as well.
Killing an app for a mistake of admin may be harsh from the perspective of 
service app, as all service containers will be killed. I was thinking whether 
we can let the app continue to run - existing containers will be running fine, 
the new requests with label will be rejected. I guess we can surface this as a 
diagnostics to the user ? 



> Application recovery has failed when node label feature is turned off during 
> RM recovery
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-6031
>                 URL: https://issues.apache.org/jira/browse/YARN-6031
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.8.0
>            Reporter: Ying Zhang
>            Assignee: Ying Zhang
>            Priority: Minor
>             Fix For: 2.9.0, 3.0.0-alpha4, 2.8.2
>
>         Attachments: YARN-6031.001.patch, YARN-6031.002.patch, 
> YARN-6031.003.patch, YARN-6031.004.patch, YARN-6031.005.patch, 
> YARN-6031.006.patch, YARN-6031.007.patch, YARN-6031-branch-2.8.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to