Yang Wang created YARN-6966: ------------------------------- Summary: NodeManager metrics may returning wrong negative values when after restart Key: YARN-6966 URL: https://issues.apache.org/jira/browse/YARN-6966 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang
Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. The primary cause of negative values is that metrics do not recover properly when NM restart. AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores in metrics also need to recover when NM restart. This should be done in ContainerManagerImpl#recoverContainer. The scenario could be reproduction by the following steps: # Make sure YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true in NM # Submit an application and keep running # Restart NM # Stop the application # Now you get the negative values {code} /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics {code} {code} { name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", modelerType: "NodeManagerMetrics", tag.Context: "yarn", tag.Hostname: "hadoop1111.com", ContainersLaunched: 0, ContainersCompleted: 0, ContainersFailed: 2, ContainersKilled: 0, ContainersIniting: 0, ContainersRunning: 0, AllocatedGB: 0, AllocatedContainers: -2, AvailableGB: 160, AllocatedVCores: -11, AvailableVCores: 3611, ContainerLaunchDurationNumOps: 2, ContainerLaunchDurationAvgTime: 6, BadLocalDirs: 0, BadLogDirs: 0, GoodLocalDirsDiskUtilizationPerc: 2, GoodLogDirsDiskUtilizationPerc: 2 } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org