[
https://issues.apache.org/jira/browse/YARN-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924805#comment-15924805
]
Varun Saxena commented on YARN-3884:
------------------------------------
bq. If only nodemanagers are reporting then allocations that are never launched
would also be missed (i.e.: RM hands the AM a bunch of containers, but the AM
sits on them for a few minutes and releases them without ever launching them).
App frameworks that perform container reuse will always have this to some
degree as the allocation races with live containers finishing and getting
reused, eliminating the need for the allocation that just arrived.
True. Such allocations which would later be released won't be captured either.
bq. It could instead post periodic events reporting the aggregate footprint of
the app
This is currently done only when an app finishes. We basically grab hold of
RMAppMetrics which finally gets the app resource usage from Scheduler.
IIRC, there was some discussion on doing it periodically. Let me find the JIRA.
bq. We can grab the individual stats of containers that actually ran, so
subtracting that from the aggregate footprint total gets us the aggregate
"overhead" in terms of reservations and unlaunched container allocations.
Should be doable.
> App History status not updated when RMContainer transitions from RESERVED to
> KILLED
> -----------------------------------------------------------------------------------
>
> Key: YARN-3884
> URL: https://issues.apache.org/jira/browse/YARN-3884
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Environment: Suse11 Sp3
> Reporter: Bibin A Chundatt
> Assignee: Bibin A Chundatt
> Labels: oct16-easy
> Attachments: 0001-YARN-3884.patch, Apphistory Container Status.jpg,
> Elapsed Time.jpg, Test Result-Container status.jpg, YARN-3884.0002.patch,
> YARN-3884.0003.patch, YARN-3884.0004.patch, YARN-3884.0005.patch,
> YARN-3884.0006.patch, YARN-3884.0007.patch, YARN-3884.0008.patch
>
>
> Setup
> ===============
> 1 NM 3072 16 cores each
> Steps to reproduce
> ===============
> 1.Submit apps to Queue 1 with 512 mb 1 core
> 2.Submit apps to Queue 2 with 512 mb and 5 core
> lots of containers get reserved and unreserved in this case
> {code}
> 2015-07-02 20:45:31,169 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_e24_1435849994778_0002_01_000013 Container Transitioned from NEW to
> RESERVED
> 2015-07-02 20:45:31,170 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> Reserved container application=application_1435849994778_0002
> resource=<memory:512, vCores:5> queue=QueueA: capacity=0.4,
> absoluteCapacity=0.4, usedResources=<memory:2560, vCores:21>,
> usedCapacity=1.6410257, absoluteUsedCapacity=0.65625, numApps=1,
> numContainers=5 usedCapacity=1.6410257 absoluteUsedCapacity=0.65625
> used=<memory:2560, vCores:21> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:31,170 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
> Re-sorting assigned queue: root.QueueA stats: QueueA: capacity=0.4,
> absoluteCapacity=0.4, usedResources=<memory:3072, vCores:26>,
> usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1,
> numContainers=6
> 2015-07-02 20:45:31,170 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
> assignedContainer queue=root usedCapacity=0.96875
> absoluteUsedCapacity=0.96875 used=<memory:5632, vCores:31>
> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:31,191 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_e24_1435849994778_0001_01_000014 Container Transitioned from NEW to
> ALLOCATED
> 2015-07-02 20:45:31,191 INFO
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf
> OPERATION=AM Allocated Container TARGET=SchedulerApp
> RESULT=SUCCESS APPID=application_1435849994778_0001
> CONTAINERID=container_e24_1435849994778_0001_01_000014
> 2015-07-02 20:45:31,191 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
> Assigned container container_e24_1435849994778_0001_01_000014 of capacity
> <memory:512, vCores:1> on host host-10-19-92-117:64318, which has 6
> containers, <memory:3072, vCores:14> used and <memory:0, vCores:2> available
> after allocation
> 2015-07-02 20:45:31,191 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> assignedContainer application attempt=appattempt_1435849994778_0001_000001
> container=Container: [ContainerId:
> container_e24_1435849994778_0001_01_000014, NodeId: host-10-19-92-117:64318,
> NodeHttpAddress: host-10-19-92-117:65321, Resource: <memory:512, vCores:1>,
> Priority: 20, Token: null, ] queue=default: capacity=0.2,
> absoluteCapacity=0.2, usedResources=<memory:2560, vCores:5>,
> usedCapacity=2.0846906, absoluteUsedCapacity=0.41666666, numApps=1,
> numContainers=5 clusterResource=<memory:6144, vCores:32>
> 2015-07-02 20:45:31,191 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
> Re-sorting assigned queue: root.default stats: default: capacity=0.2,
> absoluteCapacity=0.2, usedResources=<memory:3072, vCores:6>,
> usedCapacity=2.5016286, absoluteUsedCapacity=0.5, numApps=1, numContainers=6
> 2015-07-02 20:45:31,191 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
> assignedContainer queue=root usedCapacity=1.0 absoluteUsedCapacity=1.0
> used=<memory:6144, vCores:32> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:32,143 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_e24_1435849994778_0001_01_000014 Container Transitioned from
> ALLOCATED to ACQUIRED
> 2015-07-02 20:45:32,174 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Trying to fulfill reservation for application application_1435849994778_0002
> on node: host-10-19-92-143:64318
> 2015-07-02 20:45:32,174 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> Reserved container application=application_1435849994778_0002
> resource=<memory:512, vCores:5> queue=QueueA: capacity=0.4,
> absoluteCapacity=0.4, usedResources=<memory:3072, vCores:26>,
> usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1,
> numContainers=6 usedCapacity=2.0317461 absoluteUsedCapacity=0.8125
> used=<memory:3072, vCores:26> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:32,174 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Skipping scheduling since node host-10-19-92-143:64318 is reserved by
> application appattempt_1435849994778_0002_000001
> 2015-07-02 20:45:32,213 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_e24_1435849994778_0001_01_000014 Container Transitioned from
> ACQUIRED to RUNNING
> 2015-07-02 20:45:32,213 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Null container completed...
> 2015-07-02 20:45:33,178 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Trying to fulfill reservation for application application_1435849994778_0002
> on node: host-10-19-92-143:64318
> 2015-07-02 20:45:33,178 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> Reserved container application=application_1435849994778_0002
> resource=<memory:512, vCores:5> queue=QueueA: capacity=0.4,
> absoluteCapacity=0.4, usedResources=<memory:3072, vCores:26>,
> usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1,
> numContainers=6 usedCapacity=2.0317461 absoluteUsedCapacity=0.8125
> used=<memory:3072, vCores:26> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:33,178 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Skipping scheduling since node host-10-19-92-143:64318 is reserved by
> application appattempt_1435849994778_0002_000001
> 2015-07-02 20:45:33,704 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
> Application application_1435849994778_0002 unreserved on node host:
> host-10-19-92-143:64318 #containers=5 available=<memory:512, vCores:3>
> used=<memory:2560, vCores:13>, currently has 0 at priority 20;
> currentReservation <memory:0, vCores:0>
> 2015-07-02 20:45:33,704 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> QueueA used=<memory:2560, vCores:21> numContainers=5 user=dsperf
> user-resources=<memory:2560, vCores:21>
> 2015-07-02 20:45:33,710 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> completedContainer container=Container: [ContainerId:
> container_e24_1435849994778_0002_01_000013, NodeId: host-10-19-92-143:64318,
> NodeHttpAddress: host-10-19-92-143:65321, Resource: <memory:512, vCores:5>,
> Priority: 20, Token: null, ] queue=QueueA: capacity=0.4,
> absoluteCapacity=0.4, usedResources=<memory:2560, vCores:21>,
> usedCapacity=1.6410257, absoluteUsedCapacity=0.65625, numApps=1,
> numContainers=5 cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:33,710 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
> completedContainer queue=root usedCapacity=0.9166667
> absoluteUsedCapacity=0.9166667 used=<memory:5632, vCores:27>
> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:33,711 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
> Re-sorting completed queue: root.QueueA stats: QueueA: capacity=0.4,
> absoluteCapacity=0.4, usedResources=<memory:2560, vCores:21>,
> usedCapacity=1.6410257, absoluteUsedCapacity=0.65625, numApps=1,
> numContainers=5
> 2015-07-02 20:45:33,711 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Application attempt appattempt_1435849994778_0002_000001 released container
> container_e24_1435849994778_0002_01_000013 on node: host:
> host-10-19-92-143:64318 #containers=5 available=<memory:512, vCores:3>
> used=<memory:2560, vCores:13> with event: KILL
> {code}
> *Impact:*
> In application history server the status get updated to -1000 (INVALID)
> but the end time not updated so Elapsed Time always changes.
> Please check the snapshot attached
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]