[jira] [Commented] (YARN-3884) App History status not updated when RMContainer transitions from RESERVED to KILLED

Jason Lowe (JIRA) Tue, 14 Mar 2017 10:31:10 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924637#comment-15924637
 ]


Jason Lowe commented on YARN-3884:
----------------------------------

If only nodemanagers are reporting then allocations that are never launched 
would also be missed (i.e.: RM hands the AM a bunch of containers, but the AM 
sits on them for a few minutes and releases them without ever launching them).  
App frameworks that perform container reuse will always have this to some 
degree as the allocation races with live containers finishing and getting 
reused, eliminating the need for the allocation that just arrived.

This all comes down to what kind of view we're trying to capture.  If the user 
wants to see what the impact was to the physical nodes then having only the 
nodes report makes sense.  If we need to capture total footprint including 
times when no physical container was running yet then only the RM can report 
that picture.  However I'm not sure the RM needs to report that total picture 
only in terms of individual containers.  It could instead post periodic events 
reporting the aggregate footprint of the app (i.e: same kind of metrics added 
by YARN-415).  We can grab the individual stats of containers that actually 
ran, so subtracting that from the aggregate footprint total gets us the 
aggregate "overhead" in terms of reservations and unlaunched container 
allocations.  Since we're reporting on the order of applications rather than 
containers (something I'd expect the RM to be doing anyway for other reasons) 
then this seems like a reasonable load for the RM to bear and still gets us the 
rollup chargeback metrics.  Thoughts?

> App History status not updated when RMContainer transitions from RESERVED to 
> KILLED
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-3884
>                 URL: https://issues.apache.org/jira/browse/YARN-3884
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>         Environment: Suse11 Sp3
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>              Labels: oct16-easy
>         Attachments: 0001-YARN-3884.patch, Apphistory Container Status.jpg, 
> Elapsed Time.jpg, Test Result-Container status.jpg, YARN-3884.0002.patch, 
> YARN-3884.0003.patch, YARN-3884.0004.patch, YARN-3884.0005.patch, 
> YARN-3884.0006.patch, YARN-3884.0007.patch, YARN-3884.0008.patch
>
>
> Setup
> ===============
> 1 NM 3072 16 cores each
> Steps to reproduce
> ===============
> 1.Submit apps  to Queue 1 with 512 mb 1 core
> 2.Submit apps  to Queue 2 with 512 mb and 5 core
> lots of containers get reserved and unreserved in this case 
> {code}
> 2015-07-02 20:45:31,169 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e24_1435849994778_0002_01_000013 Container Transitioned from NEW to 
> RESERVED
> 2015-07-02 20:45:31,170 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Reserved container  application=application_1435849994778_0002 
> resource=<memory:512, vCores:5> queue=QueueA: capacity=0.4, 
> absoluteCapacity=0.4, usedResources=<memory:2560, vCores:21>, 
> usedCapacity=1.6410257, absoluteUsedCapacity=0.65625, numApps=1, 
> numContainers=5 usedCapacity=1.6410257 absoluteUsedCapacity=0.65625 
> used=<memory:2560, vCores:21> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:31,170 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Re-sorting assigned queue: root.QueueA stats: QueueA: capacity=0.4, 
> absoluteCapacity=0.4, usedResources=<memory:3072, vCores:26>, 
> usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1, 
> numContainers=6
> 2015-07-02 20:45:31,170 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.96875 
> absoluteUsedCapacity=0.96875 used=<memory:5632, vCores:31> 
> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e24_1435849994778_0001_01_000014 Container Transitioned from NEW to 
> ALLOCATED
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
> OPERATION=AM Allocated Container        TARGET=SchedulerApp     
> RESULT=SUCCESS  APPID=application_1435849994778_0001    
> CONTAINERID=container_e24_1435849994778_0001_01_000014
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: 
> Assigned container container_e24_1435849994778_0001_01_000014 of capacity 
> <memory:512, vCores:1> on host host-10-19-92-117:64318, which has 6 
> containers, <memory:3072, vCores:14> used and <memory:0, vCores:2> available 
> after allocation
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignedContainer application attempt=appattempt_1435849994778_0001_000001 
> container=Container: [ContainerId: 
> container_e24_1435849994778_0001_01_000014, NodeId: host-10-19-92-117:64318, 
> NodeHttpAddress: host-10-19-92-117:65321, Resource: <memory:512, vCores:1>, 
> Priority: 20, Token: null, ] queue=default: capacity=0.2, 
> absoluteCapacity=0.2, usedResources=<memory:2560, vCores:5>, 
> usedCapacity=2.0846906, absoluteUsedCapacity=0.41666666, numApps=1, 
> numContainers=5 clusterResource=<memory:6144, vCores:32>
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Re-sorting assigned queue: root.default stats: default: capacity=0.2, 
> absoluteCapacity=0.2, usedResources=<memory:3072, vCores:6>, 
> usedCapacity=2.5016286, absoluteUsedCapacity=0.5, numApps=1, numContainers=6
> 2015-07-02 20:45:31,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=1.0 absoluteUsedCapacity=1.0 
> used=<memory:6144, vCores:32> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:32,143 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e24_1435849994778_0001_01_000014 Container Transitioned from 
> ALLOCATED to ACQUIRED
> 2015-07-02 20:45:32,174 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1435849994778_0002 
> on node: host-10-19-92-143:64318
> 2015-07-02 20:45:32,174 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Reserved container  application=application_1435849994778_0002 
> resource=<memory:512, vCores:5> queue=QueueA: capacity=0.4, 
> absoluteCapacity=0.4, usedResources=<memory:3072, vCores:26>, 
> usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1, 
> numContainers=6 usedCapacity=2.0317461 absoluteUsedCapacity=0.8125 
> used=<memory:3072, vCores:26> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:32,174 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Skipping scheduling since node host-10-19-92-143:64318 is reserved by 
> application appattempt_1435849994778_0002_000001
> 2015-07-02 20:45:32,213 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e24_1435849994778_0001_01_000014 Container Transitioned from 
> ACQUIRED to RUNNING
> 2015-07-02 20:45:32,213 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Null container completed...
> 2015-07-02 20:45:33,178 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1435849994778_0002 
> on node: host-10-19-92-143:64318
> 2015-07-02 20:45:33,178 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Reserved container  application=application_1435849994778_0002 
> resource=<memory:512, vCores:5> queue=QueueA: capacity=0.4, 
> absoluteCapacity=0.4, usedResources=<memory:3072, vCores:26>, 
> usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1, 
> numContainers=6 usedCapacity=2.0317461 absoluteUsedCapacity=0.8125 
> used=<memory:3072, vCores:26> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:33,178 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Skipping scheduling since node host-10-19-92-143:64318 is reserved by 
> application appattempt_1435849994778_0002_000001
> 2015-07-02 20:45:33,704 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Application application_1435849994778_0002 unreserved  on node host: 
> host-10-19-92-143:64318 #containers=5 available=<memory:512, vCores:3> 
> used=<memory:2560, vCores:13>, currently has 0 at priority 20; 
> currentReservation <memory:0, vCores:0>
> 2015-07-02 20:45:33,704 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> QueueA used=<memory:2560, vCores:21> numContainers=5 user=dsperf 
> user-resources=<memory:2560, vCores:21>
> 2015-07-02 20:45:33,710 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> completedContainer container=Container: [ContainerId: 
> container_e24_1435849994778_0002_01_000013, NodeId: host-10-19-92-143:64318, 
> NodeHttpAddress: host-10-19-92-143:65321, Resource: <memory:512, vCores:5>, 
> Priority: 20, Token: null, ] queue=QueueA: capacity=0.4, 
> absoluteCapacity=0.4, usedResources=<memory:2560, vCores:21>, 
> usedCapacity=1.6410257, absoluteUsedCapacity=0.65625, numApps=1, 
> numContainers=5 cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:33,710 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> completedContainer queue=root usedCapacity=0.9166667 
> absoluteUsedCapacity=0.9166667 used=<memory:5632, vCores:27> 
> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:33,711 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Re-sorting completed queue: root.QueueA stats: QueueA: capacity=0.4, 
> absoluteCapacity=0.4, usedResources=<memory:2560, vCores:21>, 
> usedCapacity=1.6410257, absoluteUsedCapacity=0.65625, numApps=1, 
> numContainers=5
> 2015-07-02 20:45:33,711 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application attempt appattempt_1435849994778_0002_000001 released container 
> container_e24_1435849994778_0002_01_000013 on node: host: 
> host-10-19-92-143:64318 #containers=5 available=<memory:512, vCores:3> 
> used=<memory:2560, vCores:13> with event: KILL
> {code}
> *Impact:*
> In application history server the status get updated to -1000 (INVALID)
> but the end time not updated so Elapsed Time always changes.
> Please check the snapshot attached



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-3884) App History status not updated when RMContainer transitions from RESERVED to KILLED

Reply via email to