[ 
https://issues.apache.org/jira/browse/YARN-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448261#comment-17448261
 ] 

Íñigo Goiri commented on YARN-10760:
------------------------------------

Thanks [~afchung90] for the PR.
Merged it to trunk.

> Number of allocated OPPORTUNISTIC containers can dip below 0
> ------------------------------------------------------------
>
>                 Key: YARN-10760
>                 URL: https://issues.apache.org/jira/browse/YARN-10760
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.1.2
>            Reporter: Andrew Chung
>            Assignee: Andrew Chung
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 3.4.0
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> {{AbstractYarnScheduler.completedContainers}} can potentially be called from 
> multiple sources, yet it appears that there are scenarios in which the caller 
> does not hold the appropriate lock, which can lead to the count of 
> {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0.
> To prevent double counting when releasing allocated O containers, a simple 
> fix might be to check if the {{RMContainer}} has already been removed 
> beforehand, though that may not fix the underlying issue that causes the race 
> condition.
> Following is "capture" of 
> {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0 via a 
> JMX query:
> {noformat}
> {
>     "name" : 
> "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics",
>     "modelerType" : "OpportunisticSchedulerMetrics",
>     "tag.OpportunisticSchedulerMetrics" : "ResourceManager",
>     "tag.Context" : "yarn",
>     "tag.Hostname" : "",
>     "AllocatedOContainers" : -2716,
>     "AggregateOContainersAllocated" : 306020,
>     "AggregateOContainersReleased" : 308736,
>     "AggregateNodeLocalOContainersAllocated" : 0,
>     "AggregateRackLocalOContainersAllocated" : 0,
>     "AggregateOffSwitchOContainersAllocated" : 306020,
>     "AllocateLatencyOQuantilesNumOps" : 0,
>     "AllocateLatencyOQuantiles50thPercentileTime" : 0,
>     "AllocateLatencyOQuantiles75thPercentileTime" : 0,
>     "AllocateLatencyOQuantiles90thPercentileTime" : 0,
>     "AllocateLatencyOQuantiles95thPercentileTime" : 0,
>     "AllocateLatencyOQuantiles99thPercentileTime" : 0
>   }
> {noformat}
> UPDATE: Upon further investigation, it seems that the culprit is that we are 
> not incrementing AllocatedOContainers when the RM restarts, so the 
> deallocation still decrements the recovered OContainers, but we never 
> increment them on recovery. We have an initial fix for this, and are waiting 
> for verification of the fix.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to