[ 
https://issues.apache.org/jira/browse/YARN-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116220#comment-16116220
 ] 

Yuqi Wang commented on YARN-6959:
---------------------------------

Here is the log for the issue:

application_1500967702061_2512 asked for 20GB for AM Container and 5GB for its 
Task Container:
{code:java}
2017-07-31 20:58:49,532 INFO [Container Monitor] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Memory usage of ProcessTree container_e71_1500967702061_2512_01_000001 for 
container-id container_e71_1500967702061_2512_01_000001: 307.8 MB of 20 GB 
physical memory used; 1.2 GB of 30 GB virtual memory used
{code}

After its first attempt failed, the second attempt was submitted; however, NM 
mistakenly believed the AM Container was 5GB:
{code:java}
2017-07-31 21:29:46,219 INFO [Container Monitor] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Memory usage of ProcessTree container_e71_1500967702061_2512_02_000001 for 
container-id container_e71_1500967702061_2512_02_000001: 352.5 MB of 5 GB 
physical memory used; 1.4 GB of 7.5 GB virtual memory used

{code}

Here is the RM log, which also has the InvalidStateTransitonException: Invalid 
event: CONTAINER_ALLOCATED at ALLOCATED_SAVING:

{code:java}
2017-07-31 21:29:38,510 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Application added - appId: application_1500967702061_2512 user: 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue$User@57fbb4f5,
 leaf-queue: prod-new #user-pending-applications: 0 #user-active-applications: 
6 #queue-pending-applications: 0 #queue-active-applications: 6
2017-07-31 21:29:38,510 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Added Application Attempt appattempt_1500967702061_2512_000002 to scheduler 
from user hadoop in queue prod-new
2017-07-31 21:29:38,514 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1500967702061_2512_000002 State change from SUBMITTED to SCHEDULED

2017-07-31 21:29:38,517 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_e71_1500967702061_2512_02_000001 Container Transitioned from NEW to 
ALLOCATED
2017-07-31 21:29:38,517 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop       
OPERATION=AM Allocated Container        TARGET=SchedulerApp     RESULT=SUCCESS  
APPID=application_1500967702061_2512    
CONTAINERID=container_e71_1500967702061_2512_02_000001
2017-07-31 21:29:38,517 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
assignedContainer application attempt=appattempt_1500967702061_2512_000002 
container=Container: [ContainerId: container_e71_1500967702061_2512_02_000001, 
NodeId: BN2APS0A98AEA0:10025, NodeHttpAddress: 
Proxy5.Yarn-Prod-Bn2.BN2.ap.gbl:81/proxy/nodemanager/BN2APS0A98AEA0/8042, 
Resource: <memory:5120, vCores:1, ports:null>, Priority: 1, Token: null, ] 
queue=prod-new: capacity=0.7, absoluteCapacity=0.7, usedResources=<memory:0, 
vCores:0, ports:null>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=6, 
numContainers=8016 clusterResource=<memory:261614761, vCores:79088, ports:null> 
type=OFF_SWITCH
2017-07-31 21:29:38,517 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
assignedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 
used=<memory:0, vCores:0, ports:null> cluster=<memory:261614761, vCores:79088, 
ports:null>
2017-07-31 21:29:38,517 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
 Sending NMToken for nodeId : BN2APS0A98AEA0:10025 for container : 
container_e71_1500967702061_2512_02_000001
2017-07-31 21:29:38,517 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_e71_1500967702061_2512_02_000001 Container Transitioned from 
ALLOCATED to ACQUIRED
2017-07-31 21:29:38,517 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
 Clear node set for appattempt_1500967702061_2512_000002
2017-07-31 21:29:38,517  
LOP-998291496]-[download]-[0@1]-[application_1501027078051_3009],prod-new,null,null,-1,"
 for attrs weka.core.FastVector@789038c6
2017-07-31 21:29:38,517 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Storing attempt: AppId: application_1500967702061_2512 AttemptId: 
appattempt_1500967702061_2512_000002 MasterContainer: Container: [ContainerId: 
container_e71_1500967702061_2512_02_000001, NodeId: BN2APS0A98AEA0:10025, 
NodeHttpAddress: 
Proxy5.Yarn-Prod-Bn2.BN2.ap.gbl:81/proxy/nodemanager/BN2APS0A98AEA0/8042, 
Resource: <memory:5120, vCores:1, ports:null>, Priority: 1, Token: Token { 
kind: ContainerToken, service: 10.152.174.160:10025 }, ]
2017-07-31 21:29:38,518 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_e71_1500967702061_2512_02_000002 Container Transitioned from NEW to 
ALLOCATED
2017-07-31 21:29:38,518 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop       
OPERATION=AM Allocated Container        TARGET=SchedulerApp     RESULT=SUCCESS  
APPID=application_1500967702061_2512    
CONTAINERID=container_e71_1500967702061_2512_02_000002
2017-07-31 21:29:38,518 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
assignedContainer application attempt=appattempt_1500967702061_2512_000002 
container=Container: [ContainerId: container_e71_1500967702061_2512_02_000002, 
NodeId: BN1APS0A1BAECA:10025, NodeHttpAddress: 
Proxy5.Yarn-Prod-Bn2.BN2.ap.gbl:81/proxy/nodemanager/BN1APS0A1BAECA/8042, 
Resource: <memory:5120, vCores:1, ports:null>, Priority: 1, Token: null, ] 
queue=prod-new: capacity=0.7, absoluteCapacity=0.7, usedResources=<memory:0, 
vCores:0, ports:null>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=6, 
numContainers=8017 clusterResource=<memory:261614761, vCores:79088, ports:null> 
type=OFF_SWITCH
2017-07-31 21:29:38,518 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
assignedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 
used=<memory:0, vCores:0, ports:null> cluster=<memory:261614761, vCores:79088, 
ports:null>
2017-07-31 21:29:38,518 INFO [ResourceManager Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Container container_e71_1500967702061_2512_01_001344 completed with event 
FINISHED
2017-07-31 21:29:38,518 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1500967702061_2512_000002 State change from SCHEDULED to 
ALLOCATED_SAVING
2017-07-31 21:29:38,518  
LOP-998291496]-[download]-[0@1]-[application_1501027078051_3009],prod-new,null,null,-1,"
 for attrs weka.core.FastVector@789038c6
2017-07-31 21:29:38,518 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_e71_1500967702061_2512_02_000003 Container Transitioned from NEW to 
ALLOCATED
2017-07-31 21:29:38,518 INFO [Thread-13] 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop       
OPERATION=AM Allocated Container        TARGET=SchedulerApp     RESULT=SUCCESS  
APPID=application_1500967702061_2512    
CONTAINERID=container_e71_1500967702061_2512_02_000003
2017-07-31 21:29:38,518 ERROR [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
CONTAINER_ALLOCATED at ALLOCATED_SAVING
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:808)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:947)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:928)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
at java.lang.Thread.run(Thread.java:745)
{code}


> RM may allocate wrong AM Container for new attempt
> --------------------------------------------------
>
>                 Key: YARN-6959
>                 URL: https://issues.apache.org/jira/browse/YARN-6959
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.7.1
>            Reporter: Yuqi Wang
>            Assignee: Yuqi Wang
>              Labels: patch
>             Fix For: 2.7.1, 3.0.0-alpha4
>
>
> *Issue Summary:*
> Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests. These mis-recorded ResourceRequests may confuse AM 
> Container Request and Allocation for current attempt.
> *Issue Pipeline:*
> {code:java}
> // Executing precondition check for the incoming attempt id.
> ApplicationMasterService.allocate() ->
> scheduler.allocate(attemptId, ask, ...) ->
> // Previous precondition check for the attempt id may be outdated here, 
> // i.e. the currentAttempt may not be the corresponding attempt of the 
> attemptId.
> // Such as the attempt id is corresponding to the previous attempt.
> currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
> // Previous attempt ResourceRequest may be recorded into current attempt 
> ResourceRequests
> currentAttempt.updateResourceRequests(ask) ->
> // RM may allocate wrong AM Container for the current attempt, because its 
> ResourceRequests
> // may come from previous attempt which can be any ResourceRequests previous 
> AM asked
> // and there is not matching logic for the original AM Container 
> ResourceRequest and 
> // the returned amContainerAllocation below.
> AMContainerAllocatedTransition.transition(...) ->
> amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
> {code}
> *Patch Correctness:*
> Because after this Patch, RM will definitely record ResourceRequests from 
> different attempt into different objects of 
> SchedulerApplicationAttempt.AppSchedulingInfo.
> So, even if RM still record ResourceRequests from old attempt at any time, 
> these ResourceRequests will be recorded in old AppSchedulingInfo object which 
> will not impact current attempt's resource requests and allocation.
> *Concerns:*
> The getApplicationAttempt function in AbstractYarnScheduler is so confusing, 
> we should better rename it to getCurrentApplicationAttempt. And reconsider 
> whether there are any other bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to