[ 
https://issues.apache.org/jira/browse/YARN-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-4711:
------------------------------------
    Attachment: 4711Analysis.txt

Hi [~sjlee0], 
Sorry for the long delay ! 
>From the analysis i am able to identify that it happens on any exception from 
>the web server while publishing the entity. Earlier i suspected it might be 
>due to the time taken to publish but other imp cause could be we are retrying 
>to publish the entity irrespective of the exception type.
So basically 
{{ContainerManagerImpl.ContainerEventDispatcher.handle(ContainerEvent)}} -> 
{{nmMetricsPublisher.publishContainerEvent}} -> 
{{NMTimelinePublisher.ContainerEventHandler.handle(ContainerEvent)}}. So 
synchronous container metric event dispatching in 
{{NMTimelinePublisher.dispatcher}} is getting slowed down as 
{{TimelineClientImpl.putObjects}} is retrying on exception.
  


> NM is going down with NPE's due to single thread processing of events by 
> Timeline client
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-4711
>                 URL: https://issues.apache.org/jira/browse/YARN-4711
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Naganarasimha G R
>            Assignee: Naganarasimha G R
>            Priority: Critical
>              Labels: yarn-2928-1st-milestone
>         Attachments: 4711Analysis.txt
>
>
> After YARN-3367, while testing the latest 2928 branch came across few NPEs 
> due to which NM is shutting down.
> {code}
> 2016-02-21 23:19:54,078 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:306)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:296)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:213)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerFinishedEvent(NMTimelinePublisher.java:192)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.access$400(NMTimelinePublisher.java:63)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:289)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:280)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> On analysis found that the there was delay in processing of events, as after 
> YARN-3367 all the events were getting processed by a single thread inside the 
> timeline client. 
> Additionally found one scenario where there is possibility of NPE:
> * TimelineEntity.toString() when {{real}} is not null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to