[
https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuqi Wang updated YARN-6959:
----------------------------
Attachment: (was: YARN-6959-branch-2.7.004.patch)
> RM may allocate wrong AM Container for new attempt
> --------------------------------------------------
>
> Key: YARN-6959
> URL: https://issues.apache.org/jira/browse/YARN-6959
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler, fairscheduler, scheduler
> Affects Versions: 2.7.1
> Reporter: Yuqi Wang
> Assignee: Yuqi Wang
> Labels: patch
> Fix For: 2.8.0, 2.7.1, 3.0.0-alpha4
>
> Attachments: YARN-6959.005.patch, YARN-6959-branch-2.7.005.patch,
> YARN-6959-branch-2.8.001.patch, YARN-6959.yarn_nm.log.zip,
> YARN-6959.yarn_rm.log.zip
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt
> ResourceRequests. These mis-recorded ResourceRequests may confuse AM
> Container Request and Allocation for current attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here,
> // i.e. the currentAttempt may not be the corresponding attempt of the
> attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt
> ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its
> ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous
> AM asked
> // and there is not matching logic for the original AM Container
> ResourceRequest and
> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from
> different attempt into different objects of
> SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time,
> these ResourceRequests will be recorded in old AppSchedulingInfo object which
> will not impact current attempt's resource requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing,
> we should better rename it to getCurrentApplicationAttempt. And reconsider
> whether there are any other bugs related to getApplicationAttempt.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]