[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867599#comment-16867599 ] K G Bakthavachalam commented on YARN-6695: -- [~Prabhu Joseph] can u provide brief info how to reproduce this issue > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Prabhu Joseph >Priority: Critical > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-6695-002.patch, YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821663#comment-16821663 ] Prabhu Joseph commented on YARN-6695: - Thanks [~eyang]. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Prabhu Joseph >Priority: Critical > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-6695-002.patch, YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821292#comment-16821292 ] Hudson commented on YARN-6695: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #16439 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16439/]) YARN-6695. Fixed NPE in publishing appFinished events to ATSv2. (eyang: rev df76cdc8959c51b71704ab5c38335f745a6f35d8) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/TimelineServiceV2.md * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TimelineServiceV2Publisher.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisherForV2.java > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Prabhu Joseph >Priority: Critical > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-6695-002.patch, YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821278#comment-16821278 ] Eric Yang commented on YARN-6695: - +1 for patch 2. Thank you for the patch, [~Prabhu Joseph]. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Prabhu Joseph >Priority: Critical > Attachments: YARN-6695-002.patch, YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820674#comment-16820674 ] Prabhu Joseph commented on YARN-6695: - [~eyang] *yarn.rm.system-metrics-publisher.emit-container-events* is by default false but the Documentation mentions it part of basic configuration to enable TimelineService V2. This config can be enabled only if user requires RM container information and not needed for enabling the Timeline Service. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Prabhu Joseph >Priority: Critical > Attachments: YARN-6695-002.patch, YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820406#comment-16820406 ] Eric Yang commented on YARN-6695: - [~Prabhu Joseph] I see. What is the reason to remove the documentation? > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Prabhu Joseph >Priority: Critical > Attachments: YARN-6695-002.patch, YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820385#comment-16820385 ] Prabhu Joseph commented on YARN-6695: - [~eyang] The findbugs warnings are not related and will be fixed by YARN-9495. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Prabhu Joseph >Priority: Critical > Attachments: YARN-6695-002.patch, YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820375#comment-16820375 ] Eric Yang commented on YARN-6695: - [~Prabhu Joseph] It looks like findbugs detected the if statement is a contradiction to the caller that should never be null in the current location: {code} + if (timelineCollector != null) { +TimelineEntities entities = new TimelineEntities(); +entities.addEntity(entity); +timelineCollector.putEntities(entities, +UserGroupInformation.getCurrentUser()); + } else { +LOG.debug("Cannot find active collector while publishing entity " ++ entity); + } {code} It sounds like the source of NullPointerException may come from UserGroupInformation.getCurrentUser(), where the user token has expired. Can you confirm this? > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Prabhu Joseph >Priority: Critical > Attachments: YARN-6695-002.patch, YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820253#comment-16820253 ] Prabhu Joseph commented on YARN-6695: - Failed testcase from TestQueueManagementDynamicEditPolicy is not related and will be handled by YARN-9325. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Prabhu Joseph >Priority: Critical > Attachments: YARN-6695-002.patch, YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819981#comment-16819981 ] Hadoop QA commented on YARN-6695: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 26s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 50s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 10s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 27s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 37s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in trunk has 2 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 11s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 16s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 11m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 11s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 12s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 84m 33s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 27s{color} | {color:green} hadoop-yarn-site in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 49s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}175m 30s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueManagementDynamicEditPolicy | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-6695 | | JIRA Patch URL |
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819840#comment-16819840 ] Prabhu Joseph commented on YARN-6695: - [~eyang] Attached 002 patch which handles NullPointerException with debug logging. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Prabhu Joseph >Priority: Critical > Attachments: YARN-6695-002.patch, YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819603#comment-16819603 ] Eric Yang commented on YARN-6695: - [~Prabhu Joseph] I think another linger period flag is masking engineering problem but not useful to system admin. Maybe we just log as debug when NullPointerException is encountered. This will reduce the noise in log because we know it's a race condition problem. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Assignee: Prabhu Joseph >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818735#comment-16818735 ] Prabhu Joseph commented on YARN-6695: - [~eyang] Thanks for checking this. Have tried below few ways to change the ordering which improves from consistent failure to intermittent but did not find a better way to ensure collector removal only after all events handled as they are handled asynchronously. 1. Send ATTEMPT_FINISHED to {{RMAppImpl}} after sending APP_ATTEMPT_REMOVED to Scheduler in {{RMAppAttemptImpl}} 2. Stop Timeline Collector as part of doneApplication in Scheduler Another approach is to remove the collectors after a configured collector linger period similar to YARN-3995 done for NM Events. Can you check if this approach is fine. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818353#comment-16818353 ] Eric Yang commented on YARN-6695: - [~Prabhu Joseph] This fix will allow the system to continue, but it is likely to generate more error logs due to the race condition that ATTEMPT_FINISHED event before container is killed. Is there anything that can be done to fix the ordering to make sure that containers are killed and finished before ATTEMPT_FINISHED event and removes the Timeline collector last? > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817876#comment-16817876 ] Prabhu Joseph commented on YARN-6695: - This issue happens consistently with DistributedShell job by failing container launch. AM is finished and it has a container in {{ACQUIRED}} state. When AM Container Finishes and final application attempt is done, {{RMAppAttemptImpl}} sends {{RMAppEvent(ATTEMPT_FINISHED)}} to {{RMAppImpl}} and sends {{AppAttemptRemovedSchedulerEvent}} to {{CapacityScheduler}}. {{CapacityScheduler}} kills the live containers. By the time, {{CapacityScheduler}} sends Kill event and then {{RMContainerImpl}} handles Kill, {{RMAppImpl}} already handles {{ATTEMPT_FINISHED}} and removes the Timeline Collector. {code} RMAppAttemptImpl -> ATTEMPT_FINISHED -> RMAppImpl#applicationFinished() RMAppAttemptImpl -> APP_ATTEMPT_REMOVED -> CapacityScheduler#doneApplicationAttempt() -> Kill -> RMContainerImpl#containerFinished() {code} > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817641#comment-16817641 ] Hadoop QA commented on YARN-6695: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 4s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 16s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in trunk has 2 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 50s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 81m 14s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}130m 54s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-6695 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12954060/YARN-6695.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 3b1912272248 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / c4c16ca | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | findbugs | https://builds.apache.org/job/PreCommit-YARN-Build/23953/artifact/out/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-warnings.html | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23953/testReport/ | | Max. process+thread count | 919 (vs. ulimit of 1) | |
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817602#comment-16817602 ] Prabhu Joseph commented on YARN-6695: - [~rohithsharma] [~eyang] If you are okay, i will work on this. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749560#comment-16749560 ] Rohith Sharma K S commented on YARN-6695: - Yes this is corner case and need to handle it. We can fix this issue as I have already reviewed patch and given comments on current patch. Secondly, it is better to update the documentation by removing yarn.rm.system-metrics-publisher.emit-container-events configuration in *Enabling Timeline Service v.2* > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749506#comment-16749506 ] Sunil Govindan commented on YARN-6695: -- [~rohithsharma] I think we need to check on documentation as mentioned by [~yuan_zac]. We should ideally cover corner cases when this flag is made true and which should not cause any failures in RM. Thoughts ? > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749497#comment-16749497 ] Zac Zhou commented on YARN-6695: I checked the logic for TimelineServiceV2Publisher. If yarn.rm.system-metrics-publisher.emit-container-events is set to true, RM only publishes two events, containerCreated and containerFinished, to timeline server v2. I think the workload is affordable. And In the guide of TimelineServiceV2, yarn.rm.system-metrics-publisher.emit-container-events is set to true as a basic configuration to enable timeline v2. This Jira is useful. I suggest to commit it to trunk recently~ > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736803#comment-16736803 ] Wangda Tan commented on YARN-6695: -- Thanks [~eyang]/[~rohithsharma], I'm going to update target version to next release and unblock 3.1.2 and 3.2.0. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736729#comment-16736729 ] Eric Yang commented on YARN-6695: - [~rohithsharma] I copied the configuration from Ambari deployed cluster. You are correct that yarn.rm.system-metrics-publisher.emit-container-events is set to true in my cluster. Given that this would be encountered only if user explicit set config to true. I don't think this is a 3.2.0 or 3.1.2 release blocker, but good to fix this eventually. Stack trace is in description of YARN-7539. I think UserGroupInformation.getCurrentUser() returned null when YARN service does not have a final state, and application is removed from RM memory, and unable to get the owner of the app. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736704#comment-16736704 ] Rohith Sharma K S commented on YARN-6695: - [~eyang] Publishing container events from RM is disabled by default i.e *yarn.rm.system-metrics-publisher.emit-container-events* is set to *false*. Are you enabled this configuration? And we don't recommend to enable this configuration since it overloads RM with lot of events. If you can attach stack trace would be help full. Reg the patch, I am not a fan of catching NPE! Instead lets do explicit null check and log with right message something similar to NMTimelinePublisher#putEntity. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736545#comment-16736545 ] Hadoop QA commented on YARN-6695: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 36s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 44s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 94m 16s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}151m 42s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.security.TestAMRMTokens | | | hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-6695 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12954060/YARN-6695.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 8b26a6173761 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0f26b5e | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/23012/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23012/testReport/
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736436#comment-16736436 ] Eric Yang commented on YARN-6695: - YARN service encounter this bug frequently because the application does not have a completed state. The force removal of YARN service will prevent Timeline service from publishing the finishing event. Patch 001 is a workaround to make sure that Resource Manager doesn't crash when YARN app is destroyed. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734587#comment-16734587 ] Eric Yang commented on YARN-6695: - [~rohithsharma] Raising priority for this issue. When destroying a YARN service, this issue occurs every time and leading to resource manager crash. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Major > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571174#comment-16571174 ] Akhil PB commented on YARN-6695: NPE has thrown when tried to stop a service, which resulted in RM shutdown. [~rohithsharma] [~sunilg] [~vrushalic] {code:java} 2018-08-07 11:36:02,774 ERROR org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher: Error when publishing entity TimelineEntity[type='YARN_APPLICATION', id='application_1533536393859_0003'] 2018-08-07 11:36:02,833 INFO org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager: The collector service for application_1533536393859_0003 was removed 2018-08-07 11:36:02,858 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:459) at org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:73) at org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:494) at org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:483) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:748){code} > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Major > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE
[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436049#comment-16436049 ] Vrushali C commented on YARN-6695: -- Yes, similar situation in YARN-8130. We should fix these race condition jiras similar to fixes in YARN-3995 and YARN-7835. That will also help with stack traces noted in YARN-8155 > Race condition in RM for publishing container events vs appFinished events > causes NPE > -- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Major > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_02'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org