[ https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036345#comment-15036345 ]
Robert Kanter commented on YARN-4408: ------------------------------------- I haven't been able to reproduce this issue, and I agree that it's not a common occurrence; but we have seen the number of running containers go negative internally on two different clusters and also on a customer's cluster. So I started through the code and state machine for how we could decrement the gauge without first incrementing it. As far as I can tell, this is the only way where this can happen because we don't check {{container.wasLaunched}} like in the other two places where we decrement the gauge. > NodeManager still reports negative running containers > ----------------------------------------------------- > > Key: YARN-4408 > URL: https://issues.apache.org/jira/browse/YARN-4408 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.4.0 > Reporter: Robert Kanter > Assignee: Robert Kanter > Attachments: YARN-4408.001.patch > > > YARN-1697 fixed a problem where the NodeManager metrics could report a > negative number of running containers. However, it missed a rare case where > this can still happen. > YARN-1697 added a flag to indicate if the container was actually launched > ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which > is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to > {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge > if we actually ran the container and incremented the gauge . However, this > flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to > {{DONE}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)