[jira] [Comment Edited] (YARN-5190) Registering/unregistering container metrics triggered by ContainerEvent and ContainersMonitorEvent are conflict which cause uncaught exception in ContainerMonitorImpl

Junping Du (JIRA) Thu, 02 Jun 2016 09:40:22 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-5190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311496#comment-15311496
 ]


Junping Du edited comment on YARN-5190 at 6/2/16 4:31 PM:
----------------------------------------------------------

Discussed offline with [~jianhe], we think a couple of things need to get fixed 
here :
1. Fix the asymmetric behaviors in register()/unregisterSource() at 
MetricsSystemImpl that source name is still left in {{sourceNames.map}} in 
DefaultMetricsSystem after unregisterSource().

2. ContainerMetrics.finished() could get called twice - one for container life 
cycle (involved in YARN-4906) and the other in container monitoring life cycle. 
Ideally, it is better to make sure ContainerMetrics.finished() for the same 
container only get called one time one place. However, in practice, the 
container event life cycle and container monitor event life cycle are 
independent and cannot replace each other. Alternatively, we will make sure 
scheduleTimerTaskForUnregistration() only get called one time or it will be 
more threads of unregistration than needed.

3. In case one ContainerMetrics already get finished before (triggered as 
ContainerDoneTransition by ContainerKillEvent, ContianerDoneEvent, etc.), 
current logic in 
{{ContainerMonitorImpl.updateContainerMetrics(ContainersMonitorEvent)}} will 
still register metrics into DefaultMetricsSystem first (via 
ContainerMetrics.forContainer(...)) and unregister it from DefaultMetricsSystem 
soon after. This is completely unnecessary.

Will deliver a fix for three issues raised above.


was (Author: djp):
Discussed offline with [~jianhe], we think a couple of things need to get fixed 
here :
1. Fix the asymmetric behaviors in register()/unregisterSource() at 
MetricsSystemImpl that source name is still left in {{sourceNames.map}} in 
DefaultMetricsSystem after unregisterSource().

2. ContainerMetrics.finished() could get called twice - one for container life 
cycle (involved in YARN-4906) and the other in container monitoring life cycle. 
Ideally, it is better to make sure ContainerMetrics.finished() for the same 
container only get called one time one place. However, in practice, the 
container event life cycle and container monitor event life cycle are 
independent and cannot replace each other. Alternatively, we will make sure 
scheduleTimerTaskForUnregistration() only get called one time or it will be 
more threads of unregistration than needed.

3. In case one ContainerMetrics already get finished before (triggered by 
Container life cycle event), current logic in 
{{ContainerMonitorImpl.updateContainerMetrics(ContainersMonitorEvent)}} will 
still register metrics into DefaultMetricsSystem first (via 
ContainerMetrics.forContainer(...)) and unregister it from DefaultMetricsSystem 
soon after. This is completely unnecessary.

Will deliver a fix for three issues raised above.

> Registering/unregistering container metrics triggered by ContainerEvent and 
> ContainersMonitorEvent are conflict which cause uncaught exception in 
> ContainerMonitorImpl
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-5190
>                 URL: https://issues.apache.org/jira/browse/YARN-5190
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Blocker
>         Attachments: YARN-5190.patch
>
>
> The exception stack is as following:
> {noformat}
> 310735 2016-05-22 01:50:04,554 [Container Monitor] ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Container 
> Monitor,5,main] threw an Exception.
> 310736 org.apache.hadoop.metrics2.MetricsException: Metrics source 
> ContainerResource_container_1463840817638_14484_01_000010 already exists!
> 310737         at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:135)
> 310738         at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:112)
> 310739         at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
> 310740         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:212)
> 310741         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:198)
> 310742         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:385)
> {noformat}
> After YARN-4906, we have multiple places to get ContainerMetrics for a 
> particular container that could cause race condition in registering the same 
> container metrics to DefaultMetricsSystem by different threads. Lacking of 
> proper handling of MetricsException which could get thrown, the exception 
> will could bring down daemon of ContainerMonitorImpl or even whole NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-5190) Registering/unregistering container metrics triggered by ContainerEvent and ContainersMonitorEvent are conflict which cause uncaught exception in ContainerMonitorImpl

Reply via email to