Junping Du created YARN-5190:
--------------------------------
Summary: Race condition in registering container metrics cause
uncaught exception in ContainerMonitorImpl
Key: YARN-5190
URL: https://issues.apache.org/jira/browse/YARN-5190
Project: Hadoop YARN
Issue Type: Bug
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
The exception stack is as following:
{noformat}
310735 2016-05-22 01:50:04,554 [Container Monitor] ERROR
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Container
Monitor,5,main] threw an Exception.
310736 org.apache.hadoop.metrics2.MetricsException: Metrics source
ContainerResource_container_1463840817638_14484_01_000010 already exists!
310737 at
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:135)
310738 at
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:112)
310739 at
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
310740 at
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:212)
310741 at
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:198)
310742 at
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:385)
{noformat}
After YARN-4906, we have multiple places to get ContainerMetrics for a
particular container that could cause race condition in registering the same
container metrics to DefaultMetricsSystem by different threads. Lacking of
proper handling of MetricsException which could get thrown, the exception will
could bring down daemon of ContainerMonitorImpl or even whole NM.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]