Jian He commented on YARN-6959:

It's still unclear to me. For MARK2, once the lock is released, it can just 
  synchronized (lock) { // MARK2: The RPC call may be blocked here for a long 
    // MARK3: During MARK1 and here, RM may switch to the new attempt. So, 
    // attempt ResourceRequest may be recorded into current attempt 
    scheduler.allocate(attemptId, ask, ...) -> 
>From the log, I do see that the AM container size changed. Also, I see that 
>the first AM container completed at 
2017-07-31 21:29:38,338 INFO [ResourceManager Event Processor] 
container_e71_1500967702061_2512_01_000001 Container Transitioned from RUNNING 
if the AM container process had already exited, how is it possible to call 
allocate again. 
Can you check on NodeManager Log that if the first AM container indeed 
Are you able to enable debug level log and reproduce this issue ? or reproduce 
the issue with a UT.

> RM may allocate wrong AM Container for new attempt
> --------------------------------------------------
>                 Key: YARN-6959
>                 URL: https://issues.apache.org/jira/browse/YARN-6959
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, fairscheduler, scheduler
>    Affects Versions: 2.7.1
>            Reporter: Yuqi Wang
>            Assignee: Yuqi Wang
>              Labels: patch
>             Fix For: 2.7.1, 3.0.0-alpha4
>         Attachments: YARN-6959.001.patch, YARN-6959.002.patch, 
> YARN-6959.003.patch, YARN-6959.004.patch, YARN-6959.005.patch, 
> YARN-6959-branch-2.7.001.patch, YARN-6959.yarn_rm.log.zip
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests. These mis-recorded ResourceRequests may confuse AM 
> Container Request and Allocation for current attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the 
> attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its 
> ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous 
> AM asked
> // and there is not matching logic for the original AM Container 
> ResourceRequest and 
> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from 
> different attempt into different objects of 
> SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, 
> these ResourceRequests will be recorded in old AppSchedulingInfo object which 
> will not impact current attempt's resource requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, 
> we should better rename it to getCurrentApplicationAttempt. And reconsider 
> whether there are any other bugs related to getApplicationAttempt.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to