[ 
https://issues.apache.org/jira/browse/YARN-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197864#comment-15197864
 ] 

Naganarasimha G R commented on YARN-4711:
-----------------------------------------

offline comments from [~sjlee0] : 
{quote}
 The current state of NM timeline integration seems to have quite a few rough 
edges. I did look at the exceptions and here are my early thoughts. I agree 
with your other points.

(1) NPE in {{NMTimelinePublisher$ContainerEventHandler}}
I understand that this happens because the event is handled after the container 
object was removed in the NM context, correct? As a rule, I think any attempt 
to retrieve objects from the NM context in the async event handler is 
inherently dangerous because there is no guarantee that those objects are still 
there in the context. So we should review the {{NMTimelinePublisher}} code to 
spot those cases. This is one of them.

What this event handler needs is the container's resource and priority. What I 
would suggest is to add the resource and priority into the event itself. I'm 
not sure if we need to subclass {{ContainerEvent}} for this purpose... Thoughts?

(2) NPE in {{NMTimelinePublisher.putEntity()}}
This is the other place in {{NMTimelinePublisher}} where it attempts to 
retrieve an object from the context, and it fails for a similar reason. My 
question when I looked at this is, who should own {{TimelineClient}}s? 
Currently they are owned by the individual {{ApplicationImpl}} instances. I'm 
not sure if we went back and forth on this, but if {{ApplicationImpl}} goes 
away but we still need to publish, there doesn't seem to be a way. Since it's 
really {{NMTimelinePublisher}} that needs the timeline clients, should they be 
owned and managed by {{NMTImelinePublisher}}? I know it might be a rather big 
change, but I'm not sure if there is any other way to resolve this.
{quote}

> NM is going down with NPE's due to single thread processing of events by 
> Timeline client
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-4711
>                 URL: https://issues.apache.org/jira/browse/YARN-4711
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Naganarasimha G R
>            Assignee: Naganarasimha G R
>            Priority: Critical
>              Labels: yarn-2928-1st-milestone
>         Attachments: 4711Analysis.txt
>
>
> After YARN-3367, while testing the latest 2928 branch came across few NPEs 
> due to which NM is shutting down.
> {code}
> 2016-02-21 23:19:54,078 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:306)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:296)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:213)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerFinishedEvent(NMTimelinePublisher.java:192)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.access$400(NMTimelinePublisher.java:63)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:289)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:280)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> On analysis found that the there was delay in processing of events, as after 
> YARN-3367 all the events were getting processed by a single thread inside the 
> timeline client. 
> Additionally found one scenario where there is possibility of NPE:
> * TimelineEntity.toString() when {{real}} is not null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to