[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

Anup Agarwal (Jira) Tue, 30 Mar 2021 18:32:04 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311957#comment-17311957
 ]


Anup Agarwal edited comment on YARN-10724 at 3/31/21, 1:31 AM:
---------------------------------------------------------------

BTW, I have tested these changes with the discrete-time simulator 
(https://issues.apache.org/jira/browse/YARN-1187) but have not tested these on 
a real testbed. It would be good to get it reviewed.
 In the patch, I am unsure about the following two things:

(1) LeafQueue originally checked: {{ContainerExitStatus.PREEMPTED == 
containerStatus.getExitStatus()}}; while FSAppAttempt checks: 
{{containerStatus.getDiagnostics().equals(SchedulerUtils.PREEMPTED_CONTAINER)}} 
for determining if the container was preempted. Ideally, I think 'logical and' 
of these two conditions should be taken.

(2) It may be the case that someone deliberately logged preemptions metrics in 
LeafQueue due to some reason that I don't know about. Having said that the 
change in the patch is pretty much consistent with code already in FSAppAttempt.


was (Author: 108anup):
BTW, I have tested these changes with the discrete-time simulator 
(https://issues.apache.org/jira/browse/YARN-1187) but have not tested these on 
a real testbed. It would be good to get it reviewed.
In the patch, I am unsure about the following two things: # LeafQueue 
originally checked: {{ContainerExitStatus.PREEMPTED == 
containerStatus.getExitStatus()}}; while FSAppAttempt checks: 
{{containerStatus.getDiagnostics().equals(SchedulerUtils.PREEMPTED_CONTAINER)}} 
for determining if the container was preempted. Ideally, I think 'logical and' 
of these two conditions should be taken.
 # It may be the case that someone deliberately logged preemptions metrics in 
LeafQueue due to some reason that I don't know about. Having said that the 
change in the patch is pretty much consistent with code already in FSAppAttempt.

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> --------------------------------------------------------------------
>
>                 Key: YARN-10724
>                 URL: https://issues.apache.org/jira/browse/YARN-10724
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Anup Agarwal
>            Assignee: Anup Agarwal
>            Priority: Minor
>         Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

Reply via email to