[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15784818#comment-15784818
 ] 

Sunil G commented on YARN-6031:
-------------------------------

I gave a second thought. 

If we skip and continue, then we will be loosing the record of that app from RM 
end (It will be there in Store still). We may be pushing detail information to 
logs, but user/admin need not have to take a look in logs as recovery went 
smooth (discarding few apps). If those apps are considered as failed, then 
user/admin can take some actions as mentioned by Daniel.

I think if we can create RMAppImpl object and push transition into FAILED (with 
clear diags), then user/admin can easily know what happened (can then take 
action to remove from statestore or not). But not all apps may fail. For eg
- An app which was running and there are no more outstanding requests (AM 
Container will be going to default label). Old containers may belong some 
specific label. 
- An app which was running and there are more outstanding requests to some 
labels (AM Container will be going to default label) 

In first case, app will still continue running. In second case, its upto the 
app to decide when they get InvalidResourceRequestException in allocate call.

I feel we can consider accepting the app and then mark as failed with a 
validation once RMAppImpl transitions are started (need not to be done from 
scheduler, we can do in RECOVER event if possible). Thoughts?

> Application recovery failed after disabling node label
> ------------------------------------------------------
>
>                 Key: YARN-6031
>                 URL: https://issues.apache.org/jira/browse/YARN-6031
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.8.0
>            Reporter: Ying Zhang
>            Assignee: Ying Zhang
>            Priority: Minor
>         Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to