Yuqi Wang created YARN-6959:
-------------------------------
Summary: RM may allocate wrong AM Container for new attempt
Key: YARN-6959
URL: https://issues.apache.org/jira/browse/YARN-6959
Project: Hadoop YARN
Issue Type: Bug
Components: scheduler
Affects Versions: 2.7.1
Reporter: Yuqi Wang
Assignee: Yuqi Wang
Fix For: 3.0.0-alpha4, 2.7.1
*Issue Summary:*
Previous attempt ResourceRequest may be recorded into current attempt
ResourceRequests. These mis-recorded ResourceRequests may confuse AM Container
Request and Allocation for current attempt.
*Issue Pipeline:*
{code:java}
// Executing precondition check for the incoming attempt id.
ApplicationMasterService.allocate() ->
scheduler.allocate(attemptId, ask, ...) ->
// Previous precondition check for the attempt id may be outdated here,
// i.e. the currentAttempt may not be the corresponding attempt of the
attemptId.
// Such as the attempt id is corresponding to the previous attempt.
currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
// Previous attempt ResourceRequest may be recorded into current attempt
ResourceRequests
currentAttempt.updateResourceRequests(ask) ->
// RM may allocate wrong AM Container for the current attempt, because its
ResourceRequests
// may come from previous attempt which can be any ResourceRequests previous AM
asked
// and there is not matching logic for the original AM Container
ResourceRequest and
// the returned amContainerAllocation below.
AMContainerAllocatedTransition.transition(...) ->
amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
{code}
*Patch Correctness:*
Because after this Patch, RM will definitely record ResourceRequests from
different attempt into different objects of
SchedulerApplicationAttempt.AppSchedulingInfo.
So, even if RM still record ResourceRequests from old attempt at any time,
these ResourceRequests will be recorded in old AppSchedulingInfo object which
will not impact current attempt's resource requests and allocation.
*Concerns:*
The getApplicationAttempt function in AbstractYarnScheduler is so confusing, we
should better rename it to getCurrentApplicationAttempt. And reconsider whether
there are any other bugs related to getApplicationAttempt.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]