[ 
https://issues.apache.org/jira/browse/YARN-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-7177:
-------------------------------
    Attachment: YARN-7177-branch-2.7.001.patch

Attaches patch that queue statistics is calculated based on default-node-label 
resource.

> AvailableMB, AvailableVCores in the QueueMetrics is not correct when there 
> are nodes whose node-label is not default
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-7177
>                 URL: https://issues.apache.org/jira/browse/YARN-7177
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.1
>            Reporter: kyungwan nam
>         Attachments: YARN-7177-branch-2.7.001.patch
>
>
> - default-node-label has total resource <memory:248832, vCores:144>
> - ‘label1’ node-label has total resource <memory:248832, vCores:144>
> - ‘large’ and ’small’ queues are respectively 50% and 50% of 
> default-node-label capacity.
> - ‘label1’ queue is 100% of ‘label1’ node-label capacity.
> - an application using <memory:48128, vCores:13> is submitted to 'small' queue
> we could see that AvailableMB, AvailableVCores are not correct as follows.
> {code}
> {
> name: "Hadoop:service=ResourceManager,name=QueueMetrics,q0=root,q1=small",
> modelerType: "QueueMetrics,q0=root,q1=small",
> tag.Queue: "root.small",
> tag.Context: "yarn",
> tag.Hostname: "host1.com",
> running_0: 1,
> running_60: 0,
> running_300: 0,
> running_1440: 0,
> AppsSubmitted: 1,
> AppsRunning: 1,
> AppsPending: 0,
> AppsCompleted: 0,
> AppsKilled: 0,
> AppsFailed: 0,
> AllocatedMB: 48128,
> AllocatedVCores: 13,
> AllocatedContainers: 13,
> AggregateContainersAllocated: 17,
> AggregateContainersReleased: 4,
> AvailableMB: 200704,
> AvailableVCores: 131,
> PendingMB: 0,
> PendingVCores: 0,
> PendingContainers: 0,
> ReservedMB: 0,
> ReservedVCores: 0,
> ReservedContainers: 0,
> ActiveUsers: 0,
> ActiveApplications: 0
> },
> {code}
> I think it should be calculated based on default-node-label as follows.
> * AvailableMB = ( 248832 <default-node-label total resource> - 48128 <used 
> resource> ) * 0.5 <small queue capacity>
> * AvailableVCores = ( 144 <default-node-label total resource> - 13 <used 
> resource> ) * 0.5 <small queue capacity>
> we could see the another problem that absoluteUsedCapacity, usedCapacity are 
> not correct in the log.
> {code}
> 2017-09-07 16:21:06,058 INFO  capacity.LeafQueue 
> (LeafQueue.java:releaseResource(1762)) - small used=<memory:48128, vCores:13> 
> numContainers=13 user=test user-resources=<memory:48128, vCores:13>
> 2017-09-07 16:21:06,058 INFO  capacity.LeafQueue 
> (LeafQueue.java:completedContainer(1713)) - completedContainer 
> container=Container: [ContainerId: 
> container_e15_1504768325902_0001_01_000017, NodeId: host2.com:45454, 
> NodeHttpAddress: host2.com:8042, Resource: <memory:4096, vCores:1>, Priority: 
> 1073741826, Token: Token { kind: ContainerToken, service: 10.10.10.1:45454 }, 
> ] queue=small: capacity=0.5, absoluteCapacity=0.5, 
> usedResources=<memory:48128, vCores:13>, usedCapacity=0.19341564, 
> absoluteUsedCapacity=0.09670782, numApps=1, numContainers=13 
> cluster=<memory:497664, vCores:288>
> {code}
> Those are calculated based on total resources for all node-labels.
> likewise, it should be default-node-label based as follows.
> * usedCapacity = 48128 <used resource> / ( 248832 <default-node-label total 
> resource> * 0.5 <small queue capacity> = 0.38683127
> * absoluteUsedCapacity = 48128 <used resource> / 248832 <default-node-label 
> total resource> = 0.19341563
> it makes me confused.
> but that’s not all. because the absoluteUsedCapacity is used in 
> ProportionalCapacityPreemptionPolicy, wrong value can cause a problem with 
> regards to preemption.
> {code}
>   private TempQueue cloneQueues(CSQueue root, Resource clusterResources) {
>     TempQueue ret;
>     synchronized (root) {
>       String queueName = root.getQueueName();
>       float absUsed = root.getAbsoluteUsedCapacity();
>       float absCap = root.getAbsoluteCapacity();
>       float absMaxCap = root.getAbsoluteMaximumCapacity();
>       boolean preemptionDisabled = root.getPreemptionDisabled();
> {code}
> it seems like this problem does not happen in the hadoop-2.8 or higher. 
> but, we need to fix it for the hadoop-2.7.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to