[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-06-19 Thread K G Bakthavachalam (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867599#comment-16867599
 ] 

K G Bakthavachalam commented on YARN-6695:
--

[~Prabhu Joseph] can u provide brief info how to reproduce this issue

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Prabhu Joseph
>Priority: Critical
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-6695-002.patch, YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-18 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821663#comment-16821663
 ] 

Prabhu Joseph commented on YARN-6695:
-

Thanks [~eyang].

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Prabhu Joseph
>Priority: Critical
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-6695-002.patch, YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821292#comment-16821292
 ] 

Hudson commented on YARN-6695:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #16439 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/16439/])
YARN-6695. Fixed NPE in publishing appFinished events to ATSv2.  
(eyang: rev df76cdc8959c51b71704ab5c38335f745a6f35d8)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/TimelineServiceV2.md
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TimelineServiceV2Publisher.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisherForV2.java


> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Prabhu Joseph
>Priority: Critical
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-6695-002.patch, YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-18 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821278#comment-16821278
 ] 

Eric Yang commented on YARN-6695:
-

+1 for patch 2.  Thank you for the patch, [~Prabhu Joseph].

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: YARN-6695-002.patch, YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-17 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820674#comment-16820674
 ] 

Prabhu Joseph commented on YARN-6695:
-

[~eyang] *yarn.rm.system-metrics-publisher.emit-container-events* is by default 
false but the Documentation mentions it part of basic configuration to enable 
TimelineService V2. This config can be enabled only if user requires RM 
container information and not needed for enabling the Timeline Service. 

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: YARN-6695-002.patch, YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-17 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820406#comment-16820406
 ] 

Eric Yang commented on YARN-6695:
-

[~Prabhu Joseph] I see.  What is the reason to remove the documentation?

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: YARN-6695-002.patch, YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-17 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820385#comment-16820385
 ] 

Prabhu Joseph commented on YARN-6695:
-

[~eyang] The findbugs warnings are not related and will be fixed by YARN-9495.

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: YARN-6695-002.patch, YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-17 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820375#comment-16820375
 ] 

Eric Yang commented on YARN-6695:
-

[~Prabhu Joseph] It looks like findbugs detected the if statement is a 
contradiction to the caller that should never be null in the current location:

{code}
+  if (timelineCollector != null) {
+TimelineEntities entities = new TimelineEntities();
+entities.addEntity(entity);
+timelineCollector.putEntities(entities,
+UserGroupInformation.getCurrentUser());
+  } else {
+LOG.debug("Cannot find active collector while publishing entity "
++ entity);
+  }
{code}

It sounds like the source of NullPointerException may come from 
UserGroupInformation.getCurrentUser(), where the user token has expired.  Can 
you confirm this?

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: YARN-6695-002.patch, YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-17 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820253#comment-16820253
 ] 

Prabhu Joseph commented on YARN-6695:
-

Failed testcase from TestQueueManagementDynamicEditPolicy is not related and 
will be handled by YARN-9325.

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: YARN-6695-002.patch, YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819981#comment-16819981
 ] 

Hadoop QA commented on YARN-6695:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
26s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
50s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
33s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 27s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
37s{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 in trunk has 2 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
11s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
16s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 11m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 11s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
12s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 84m 33s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
27s{color} | {color:green} hadoop-yarn-site in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
49s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}175m 30s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueManagementDynamicEditPolicy
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | YARN-6695 |
| JIRA Patch URL | 

[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-17 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819840#comment-16819840
 ] 

Prabhu Joseph commented on YARN-6695:
-

[~eyang] Attached 002 patch which handles NullPointerException with debug 
logging.

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: YARN-6695-002.patch, YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-16 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819603#comment-16819603
 ] 

Eric Yang commented on YARN-6695:
-

[~Prabhu Joseph] I think another linger period flag is masking engineering 
problem but not useful to system admin.  Maybe we just log as debug when 
NullPointerException is encountered.  This will reduce the noise in log because 
we know it's a race condition problem.

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-16 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818735#comment-16818735
 ] 

Prabhu Joseph commented on YARN-6695:
-

[~eyang] Thanks for checking this. Have tried below few ways to change the 
ordering which improves from consistent failure to intermittent but did not 
find a better way to ensure collector removal only after all events handled as 
they are handled asynchronously.

1. Send ATTEMPT_FINISHED to {{RMAppImpl}} after sending APP_ATTEMPT_REMOVED to 
Scheduler in {{RMAppAttemptImpl}} 
2. Stop Timeline Collector as part of doneApplication in Scheduler

Another approach is to remove the collectors after a configured collector 
linger period similar to YARN-3995 done for NM Events. Can you check if this 
approach is fine.








> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-15 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818353#comment-16818353
 ] 

Eric Yang commented on YARN-6695:
-

[~Prabhu Joseph] This fix will allow the system to continue, but it is likely 
to generate more error logs due to the race condition that ATTEMPT_FINISHED 
event before container is killed.  Is there anything that can be done to fix 
the ordering to make sure that containers are killed and finished before 
ATTEMPT_FINISHED event and removes the Timeline collector last?

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-15 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817876#comment-16817876
 ] 

Prabhu Joseph commented on YARN-6695:
-

This issue happens consistently with DistributedShell job by failing container 
launch. AM is finished and it has a container in {{ACQUIRED}} state.

When AM Container Finishes and final application attempt is done, 
{{RMAppAttemptImpl}} sends {{RMAppEvent(ATTEMPT_FINISHED)}}  to {{RMAppImpl}} 
and sends {{AppAttemptRemovedSchedulerEvent}} to {{CapacityScheduler}}.  

{{CapacityScheduler}} kills the live containers. By the time, 
{{CapacityScheduler}} sends Kill event and then {{RMContainerImpl}} handles 
Kill, {{RMAppImpl}} already handles {{ATTEMPT_FINISHED}} and removes the 
Timeline Collector.

{code}
RMAppAttemptImpl -> ATTEMPT_FINISHED -> RMAppImpl#applicationFinished()

RMAppAttemptImpl -> APP_ATTEMPT_REMOVED -> 
CapacityScheduler#doneApplicationAttempt() -> Kill -> 
RMContainerImpl#containerFinished()
{code}








> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-15 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817641#comment-16817641
 ] 

Hadoop QA commented on YARN-6695:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
15s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  4s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
16s{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 in trunk has 2 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 50s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 81m 
14s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}130m 54s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | YARN-6695 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12954060/YARN-6695.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 3b1912272248 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / c4c16ca |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| findbugs | 
https://builds.apache.org/job/PreCommit-YARN-Build/23953/artifact/out/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-warnings.html
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23953/testReport/ |
| Max. process+thread count | 919 (vs. ulimit of 1) |
| 

[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-04-15 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817602#comment-16817602
 ] 

Prabhu Joseph commented on YARN-6695:
-

[~rohithsharma] [~eyang] If you are okay, i will work on this.

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-01-22 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749560#comment-16749560
 ] 

Rohith Sharma K S commented on YARN-6695:
-

Yes this is corner case and need to handle it. We can fix this issue as I have 
already reviewed patch and given comments on current patch. Secondly, it is 
better to update the documentation by removing 
yarn.rm.system-metrics-publisher.emit-container-events configuration in 
*Enabling Timeline Service v.2*

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-01-22 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749506#comment-16749506
 ] 

Sunil Govindan commented on YARN-6695:
--

[~rohithsharma]

I think we need to check on documentation as mentioned by [~yuan_zac]. We 
should ideally cover corner cases when this flag is made true and which should 
not cause any failures in RM. Thoughts ?

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-01-22 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749497#comment-16749497
 ] 

Zac Zhou commented on YARN-6695:


I checked the logic for TimelineServiceV2Publisher. If 
yarn.rm.system-metrics-publisher.emit-container-events is set to true, RM only 
publishes two events, containerCreated and containerFinished, to timeline 
server v2. I think the workload is affordable.  

And In the guide of TimelineServiceV2,  
yarn.rm.system-metrics-publisher.emit-container-events is set to true as a 
basic configuration to enable timeline v2.

This Jira is useful. I suggest to commit it to trunk recently~

 

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-01-07 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736803#comment-16736803
 ] 

Wangda Tan commented on YARN-6695:
--

Thanks [~eyang]/[~rohithsharma], I'm going to update target version to next 
release and unblock 3.1.2 and 3.2.0.

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-01-07 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736729#comment-16736729
 ] 

Eric Yang commented on YARN-6695:
-

[~rohithsharma] I copied the configuration from Ambari deployed cluster.  You 
are correct that yarn.rm.system-metrics-publisher.emit-container-events is set 
to true in my cluster.  Given that this would be encountered only if user 
explicit set config to true.  I don't think this is a 3.2.0 or 3.1.2 release 
blocker, but good to fix this eventually.

Stack trace is in description of YARN-7539.  I think 
UserGroupInformation.getCurrentUser() returned null when YARN service does not 
have a final state, and application is removed from RM memory, and unable to 
get the owner of the app.

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-01-07 Thread Rohith Sharma K S (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736704#comment-16736704
 ] 

Rohith Sharma K S commented on YARN-6695:
-

[~eyang] Publishing container events from RM is disabled by default i.e 
*yarn.rm.system-metrics-publisher.emit-container-events* is set to *false*. Are 
you enabled this configuration? And we don't recommend to enable this 
configuration since it overloads RM with lot of events. If you can attach stack 
trace would be help full. 

Reg the patch, I am not a fan of catching NPE! Instead lets do explicit null 
check and log with right message something similar to 
NMTimelinePublisher#putEntity. 

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-01-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736545#comment-16736545
 ] 

Hadoop QA commented on YARN-6695:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
18s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 36s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 44s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 94m 16s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}151m 42s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.security.TestAMRMTokens |
|   | hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy 
|
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-6695 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12954060/YARN-6695.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 8b26a6173761 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 0f26b5e |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/23012/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23012/testReport/ 

[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-01-07 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736436#comment-16736436
 ] 

Eric Yang commented on YARN-6695:
-

YARN service encounter this bug frequently because the application does not 
have a completed state.  The force removal of YARN service will prevent 
Timeline service from publishing the finishing event.  Patch 001 is a 
workaround to make sure that Resource Manager doesn't crash when YARN app is 
destroyed.

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-01-04 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734587#comment-16734587
 ] 

Eric Yang commented on YARN-6695:
-

[~rohithsharma] Raising priority for this issue.  When destroying a YARN 
service, this issue occurs every time and leading to resource manager crash.  

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Major
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2018-08-07 Thread Akhil PB (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571174#comment-16571174
 ] 

Akhil PB commented on YARN-6695:


NPE has thrown when tried to stop a service, which resulted in RM shutdown.

[~rohithsharma] [~sunilg] [~vrushalic]
{code:java}
2018-08-07 11:36:02,774 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher:
 Error when publishing entity TimelineEntity[type='YARN_APPLICATION', 
id='application_1533536393859_0003']
2018-08-07 11:36:02,833 INFO 
org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager:
 The collector service for application_1533536393859_0003 was removed
2018-08-07 11:36:02,858 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
Error in dispatcher thread
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:459)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:73)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:494)
at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:483)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:748){code}

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Major
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2018-04-12 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436049#comment-16436049
 ] 

Vrushali C commented on YARN-6695:
--

Yes, similar situation in YARN-8130. We should fix these race condition jiras 
similar to fixes in YARN-3995 and YARN-7835. 

That will also help with stack traces noted in YARN-8155

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Major
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org