[
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804106#comment-15804106
]
Ying Zhang commented on YARN-6031:
----------------------------------
I've uploaded a new patch YARN-6031.002.patch with the suggested approach.
Please kindly review and see if this is what we wanted. I'm working on adding
test case now. Thank you.
[~sunilg],
{quote}
I think if we can create RMAppImpl object and push transition into FAILED (with
clear diags), then user/admin can easily know what happened (can then take
action to remove from statestore or not). But not all apps may fail. For eg
An app which was running and there are no more outstanding requests (AM
Container will be going to default label). Old containers may belong some
specific label.
An app which was running and there are more outstanding requests to some labels
(AM Container will be going to default label)
{quote}
I'm not sure how to achieve this. With current patch, all applications with
specified node label expression will be rejected and fail during recovery
(including those which had already successfully finished before we disabling
Node label and restarting RM.) Please share your thoughts if you think this
should be improved.:-)
> Application recovery failed after disabling node label
> ------------------------------------------------------
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
> Issue Type: Bug
> Components: scheduler
> Affects Versions: 2.8.0
> Reporter: Ying Zhang
> Assignee: Ying Zhang
> Priority: Minor
> Attachments: YARN-6031.001.patch, YARN-6031.002.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by:
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException:
> Invalid resource request, node label not enabled but request contains label
> expression
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had
> node label expression specified while node label has been disabled.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]