[jira] [Comment Edited] (YARN-6847) NPE in RM while starting timeline collector on recovery after explicit failover

Varun Saxena (JIRA) Wed, 19 Jul 2017 15:26:24 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093898#comment-16093898
 ]


Varun Saxena edited comment on YARN-6847 at 7/19/17 10:25 PM:
--------------------------------------------------------------

Found this issue while writing tests for YARN-6130.
NPE is because of RMTimelineCollectorManager object in RMContext being null.

This is because RMTimelineCollectorManager instance is set in active service 
context inside RMContextImpl but the object for it is created inside 
ResourceManager#serviceInit. This means if RM is made to transition to standby, 
active service context will be reset(created again) and 
RMTimelineCollectorManager object will never be set in it.

This means that when RM subsequently becomes active, during recovery if a 
timeline collector for a recovered app is to be started, that would fail due to 
a NPE.


was (Author: varun_saxena):
Found this issue while writing tests for YARN-6130.
NPE is because RMTimelineCollectorManager in RMContext being null.

This is because RMTimelineCollectorManager instance is set in active service 
context inside RMContextImpl but the object for it is created inside 
ResourceManager#serviceInit. This means if RM is made to transition to standby, 
active service context will be reset(created again) and 
RMTimelineCollectorManager object will never be set in it.

This means that when RM subsequently becomes active, during recovery if a 
timeline collector for a recovered app is to be started, that would fail due to 
a NPE.

> NPE in RM while starting timeline collector on recovery after explicit 
> failover
> -------------------------------------------------------------------------------
>
>                 Key: YARN-6847
>                 URL: https://issues.apache.org/jira/browse/YARN-6847
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Varun Saxena
>
> {noformat}
> 2017-07-20 03:20:50,742 ERROR [Thread-449] resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart(763)) - Failed to load/recover state
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.startTimelineCollector(RMAppImpl.java:535)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:467)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:336)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:576)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1419)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:758)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1178)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1218)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1214)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1214)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:319)
>         at 
> org.apache.hadoop.yarn.client.ProtocolHATestBase.explicitFailover(ProtocolHATestBase.java:205)
>         at 
> org.apache.hadoop.yarn.client.ProtocolHATestBase$1.run(ProtocolHATestBase.java:250)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-6847) NPE in RM while starting timeline collector on recovery after explicit failover

Reply via email to