[
https://issues.apache.org/jira/browse/YARN-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441828#comment-17441828
]
Andrew Chung commented on YARN-10760:
-------------------------------------
[~inigoiri] I've created a PR for this issue, please have a look, thanks!
> Number of allocated OPPORTUNISTIC containers can dip below 0
> ------------------------------------------------------------
>
> Key: YARN-10760
> URL: https://issues.apache.org/jira/browse/YARN-10760
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 3.1.2
> Reporter: Andrew Chung
> Assignee: Andrew Chung
> Priority: Minor
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> {{AbstractYarnScheduler.completedContainers}} can potentially be called from
> multiple sources, yet it appears that there are scenarios in which the caller
> does not hold the appropriate lock, which can lead to the count of
> {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0.
> To prevent double counting when releasing allocated O containers, a simple
> fix might be to check if the {{RMContainer}} has already been removed
> beforehand, though that may not fix the underlying issue that causes the race
> condition.
> Following is "capture" of
> {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0 via a
> JMX query:
> {noformat}
> {
> "name" :
> "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics",
> "modelerType" : "OpportunisticSchedulerMetrics",
> "tag.OpportunisticSchedulerMetrics" : "ResourceManager",
> "tag.Context" : "yarn",
> "tag.Hostname" : "",
> "AllocatedOContainers" : -2716,
> "AggregateOContainersAllocated" : 306020,
> "AggregateOContainersReleased" : 308736,
> "AggregateNodeLocalOContainersAllocated" : 0,
> "AggregateRackLocalOContainersAllocated" : 0,
> "AggregateOffSwitchOContainersAllocated" : 306020,
> "AllocateLatencyOQuantilesNumOps" : 0,
> "AllocateLatencyOQuantiles50thPercentileTime" : 0,
> "AllocateLatencyOQuantiles75thPercentileTime" : 0,
> "AllocateLatencyOQuantiles90thPercentileTime" : 0,
> "AllocateLatencyOQuantiles95thPercentileTime" : 0,
> "AllocateLatencyOQuantiles99thPercentileTime" : 0
> }
> {noformat}
> UPDATE: Upon further investigation, it seems that the culprit is that we are
> not incrementing AllocatedOContainers when the RM restarts, so the
> deallocation still decrements the recovered OContainers, but we never
> increment them on recovery. We have an initial fix for this, and are waiting
> for verification of the fix.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]