[
https://issues.apache.org/jira/browse/YARN-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205510#comment-15205510
]
Sangjin Lee commented on YARN-4711:
-----------------------------------
Thanks for the proposed patch [~Naganarasimha]! I am going over it.
I did want to discuss one high level observation. It seems that you're taking
an approach of invoking the {{TimelineClient}} directly for async writes while
still using the dispatcher for sync writes. I understand that it is
functionally correct, and incidentally it also may solve one of the NPEs. On
the other hand, one downside is that we would have two very distinct sets of
code to write within {{NMTimelinePublisher}}, one for async writes and another
for sync writes. I'm still thinking about that, and I'm not sure whether it is
ideal or not.
If we had a way to address the NPE issue but stick with the current style
(using the dispatcher both for sync and async writes), it would lead to simpler
code that's easier to maintain, right? What is your thought on this? Pros and
cons?
> NM is going down with NPE's due to single thread processing of events by
> Timeline client
> ----------------------------------------------------------------------------------------
>
> Key: YARN-4711
> URL: https://issues.apache.org/jira/browse/YARN-4711
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Reporter: Naganarasimha G R
> Assignee: Naganarasimha G R
> Priority: Critical
> Labels: yarn-2928-1st-milestone
> Attachments: 4711Analysis.txt, YARN-4711-YARN-2928.v1.001.patch
>
>
> After YARN-3367, while testing the latest 2928 branch came across few NPEs
> due to which NM is shutting down.
> {code}
> 2016-02-21 23:19:54,078 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher:
> Error in dispatcher thread
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:306)
> at
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:296)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:213)
> at
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerFinishedEvent(NMTimelinePublisher.java:192)
> at
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.access$400(NMTimelinePublisher.java:63)
> at
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:289)
> at
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:280)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> On analysis found that the there was delay in processing of events, as after
> YARN-3367 all the events were getting processed by a single thread inside the
> timeline client.
> Additionally found one scenario where there is possibility of NPE:
> * TimelineEntity.toString() when {{real}} is not null
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)