[jira] [Commented] (YARN-5190) Registering/unregistering container metrics triggered by ContainerEvent and ContainersMonitorEvent are conflict which cause uncaught exception in ContainerMonitorImpl

Jian He (JIRA) Thu, 02 Jun 2016 11:40:10 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-5190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312850#comment-15312850
 ]


Jian He commented on YARN-5190:
-------------------------------

looks good, only minor comments on the format: is below slightly better, 
avoiding a couple null check
{code}
     ContainerId containerId = monitoringEvent.getContainerId();
-    ContainerMetrics usageMetrics = ContainerMetrics
-        .forContainer(containerId, containerMetricsPeriodMs,
-        containerMetricsUnregisterDelayMs);
+    ContainerMetrics usageMetrics;

     int vmemLimitMBs;
     int pmemLimitMBs;
     int cpuVcores;
     switch (monitoringEvent.getType()) {
     case START_MONITORING_CONTAINER:
+     usageMetrics = ContainerMetrics
+          .forContainer(containerId, containerMetricsPeriodMs,
+          containerMetricsUnregisterDelayMs);
       ContainerStartMonitoringEvent startEvent =
           (ContainerStartMonitoringEvent) monitoringEvent;
       usageMetrics.recordStateChangeDurations(
@@ -640,9 +642,16 @@ private void updateContainerMetrics(ContainersMonitorEvent 
monitoringEvent) {
           vmemLimitMBs, pmemLimitMBs, cpuVcores);
       break;
     case STOP_MONITORING_CONTAINER:
-      usageMetrics.finished();
+       usageMetrics = ContainerMetrics.getContainerMetrics(
+          containerId);
+      if (usageMetrics != null) {
+        usageMetrics.finished();
+      }
       break;
     case CHANGE_MONITORING_CONTAINER_RESOURCE:
+      usageMetrics = ContainerMetrics
+          .forContainer(containerId, containerMetricsPeriodMs,
+              containerMetricsUnregisterDelayMs);
{code}

> Registering/unregistering container metrics triggered by ContainerEvent and 
> ContainersMonitorEvent are conflict which cause uncaught exception in 
> ContainerMonitorImpl
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-5190
>                 URL: https://issues.apache.org/jira/browse/YARN-5190
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Blocker
>         Attachments: YARN-5190.patch
>
>
> The exception stack is as following:
> {noformat}
> 310735 2016-05-22 01:50:04,554 [Container Monitor] ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Container 
> Monitor,5,main] threw an Exception.
> 310736 org.apache.hadoop.metrics2.MetricsException: Metrics source 
> ContainerResource_container_1463840817638_14484_01_000010 already exists!
> 310737         at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:135)
> 310738         at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:112)
> 310739         at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
> 310740         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:212)
> 310741         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:198)
> 310742         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:385)
> {noformat}
> After YARN-4906, we have multiple places to get ContainerMetrics for a 
> particular container that could cause race condition in registering the same 
> container metrics to DefaultMetricsSystem by different threads. Lacking of 
> proper handling of MetricsException which could get thrown, the exception 
> will could bring down daemon of ContainerMonitorImpl or even whole NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-5190) Registering/unregistering container metrics triggered by ContainerEvent and ContainersMonitorEvent are conflict which cause uncaught exception in ContainerMonitorImpl

Reply via email to