[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

Anup Agarwal (Jira) Thu, 01 Apr 2021 11:21:07 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313363#comment-17313363
 ]


Anup Agarwal edited comment on YARN-10724 at 4/1/21, 6:20 PM:
--------------------------------------------------------------

completedContainer getting called multiple times may or may not be an issue, 
but logging the same event multiple times might be. SchedulerApplicationAttempt 
maintains a liveContainers collection and uses it to deduplicate container 
completion (incl preemption) events; while leafQueue does no such thing, that's 
why the patch moved the preemption logging to AppAttempt rather than leafQueue, 
similar to FSAppAttempt.


was (Author: 108anup):
completedContainer getting called multiple times may or may not be an issue, 
but logging the same event multiple times might be. SchedulerApplicationAttempt 
maintains a liveContainers collection and uses it to deduplicate preemption 
events; while leafQueue does no such thing, that's why the patch moved the 
preemption logging to AppAttempt rather than leafQueue, similar to FSAppAttempt.

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> --------------------------------------------------------------------
>
>                 Key: YARN-10724
>                 URL: https://issues.apache.org/jira/browse/YARN-10724
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Anup Agarwal
>            Assignee: Anup Agarwal
>            Priority: Minor
>         Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

Reply via email to