[ 
https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119334#comment-16119334
 ] 

Yuqi Wang commented on YARN-6959:
---------------------------------

[~jianhe]
Reproduce the race condition during below segment pipeline of one AM RM RPC 
call:
{code:java}
// One AM RM RPC call
ApplicationMasterService.allocate() {
  AllocateResponseLock lock = responseMap.get(appAttemptId);
  if (lock == null) { // MARK1: At this time, the appAttemptId is still current 
attempt, so the RPC call continues.
    ...
    throw new ApplicationAttemptNotFoundException();
  }
  synchronized (lock) { // MARK2: The RPC call may be blocked here for a long 
time
    ...
    // MARK3: During MARK1 and here, RM may switch to the new attempt. So, 
previous 
    // attempt ResourceRequest may be recorded into current attempt 
ResourceRequests 
    scheduler.allocate(attemptId, ask, ...) -> 
scheduler.getApplicationAttempt(attemptId)
    ...
  }
}
{code}


I saw the log you mentioned. It shows that, RM switched to the new attempt and 
afterwards there was still some allocate() from previous attempt came into the 
scheduler.
For details, I just attached the full log in the attachment, please check.
{code:java}
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_e71_1500967702061_2512_01_000361 Container Transitioned from RUNNING 
to COMPLETED
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
 Completed container: container_e71_1500967702061_2512_01_000361 in state: 
COMPLETED event:FINISHED
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop 
OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
APPID=application_1500967702061_2512    
CONTAINERID=container_e71_1500967702061_2512_01_000361
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
prod-new used=<memory:0, vCores:0, ports:null> numContainers=9349 user=hadoop 
user-resources=<memory:0, vCores:0, ports:null>
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
completedContainer container=Container: [ContainerId: 
container_e71_1500967702061_2512_01_000361, NodeId: BN1APS0A410B91:10025, 
NodeHttpAddress: 
Proxy5.Yarn-Prod-Bn2.BN2.ap.gbl:81/proxy/nodemanager/BN1APS0A410B91/8042, 
Resource: <memory:5120, vCores:1, ports:null>, Priority: 1, Token: Token { 
kind: ContainerToken, service: 10.65.11.145:10025 }, ] queue=prod-new: 
capacity=0.7, absoluteCapacity=0.7, usedResources=<memory:0, vCores:0, 
ports:null>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=6, 
numContainers=9349 cluster=<memory:261614761, vCores:79088, ports:null>
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
completedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 
used=<memory:0, vCores:0, ports:null> cluster=<memory:261614761, vCores:79088, 
ports:null>
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
Re-sorting completed queue: root.prod-new stats: prod-new: capacity=0.7, 
absoluteCapacity=0.7, usedResources=<memory:0, vCores:0, ports:null>, 
usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=6, numContainers=9349
2017-07-31 21:29:38,351 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Application attempt appattempt_1500967702061_2512_000001 released container 
container_e71_1500967702061_2512_01_000361 on node: host: BN1APS0A410B91:10025 
#containers=3 available=<memory:30977, vCores:23, ports:null> 
used=<memory:23552, vCores:3, ports:null> with event: FINISHED
2017-07-31 21:29:38,353 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Unregistering app attempt : appattempt_1500967702061_2512_000001
2017-07-31 21:29:38,353 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: 
Application finished, removing password for appattempt_1500967702061_2512_000001
2017-07-31 21:29:38,353 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1500967702061_2512_000001 State change from FINAL_SAVING to FAILED
2017-07-31 21:29:38,353 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The number of 
failed attempts is 1. The max attempts is 3
2017-07-31 21:29:38,354 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1500967702061_2512 State change from RUNNING to ACCEPTED
2017-07-31 21:29:38,354 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Application Attempt appattempt_1500967702061_2512_000001 is done. 
finalState=FAILED
2017-07-31 21:29:38,354 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Registering app attempt : appattempt_1500967702061_2512_000002
2017-07-31 21:29:38,354 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1500967702061_2512_000002 State change from NEW to SUBMITTED
2017-07-31 21:29:38,354 INFO [ApplicationMasterLauncher #49] 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning 
master appattempt_1500967702061_2512_000001
{code}



> RM may allocate wrong AM Container for new attempt
> --------------------------------------------------
>
>                 Key: YARN-6959
>                 URL: https://issues.apache.org/jira/browse/YARN-6959
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, fairscheduler, scheduler
>    Affects Versions: 2.7.1
>            Reporter: Yuqi Wang
>            Assignee: Yuqi Wang
>              Labels: patch
>             Fix For: 2.7.1, 3.0.0-alpha4
>
>         Attachments: YARN-6959.001.patch, YARN-6959.002.patch, 
> YARN-6959.003.patch, YARN-6959.004.patch, YARN-6959.005.patch, 
> YARN-6959-branch-2.7.001.patch
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests. These mis-recorded ResourceRequests may confuse AM 
> Container Request and Allocation for current attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the 
> attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its 
> ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous 
> AM asked
> // and there is not matching logic for the original AM Container 
> ResourceRequest and 
> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from 
> different attempt into different objects of 
> SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, 
> these ResourceRequests will be recorded in old AppSchedulingInfo object which 
> will not impact current attempt's resource requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, 
> we should better rename it to getCurrentApplicationAttempt. And reconsider 
> whether there are any other bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to