[ 
https://issues.apache.org/jira/browse/YARN-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418461#comment-16418461
 ] 

Tao Yang commented on YARN-8085:
--------------------------------

Thanks [~cheersyang] for your suggestion.

Yes, RMServiceContext contains services which will be running always 
irrespective of the HA state of the RM. It's better to move 
ResourceProfilesManager into RMServiceContext.

Attached v2 patch for review.

> RMContext#resourceProfilesManager is lost after RM went standby then back to 
> active
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-8085
>                 URL: https://issues.apache.org/jira/browse/YARN-8085
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 3.1.0
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Major
>         Attachments: YARN-8085.001.patch, YARN-8085.002.patch
>
>
> We submited a distributed shell application after RM failover and back to 
> active, then got NPE error in RM log:
> {noformat}
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getResourceProfiles(ClientRMService.java:1814)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getResourceProfiles(ApplicationClientProtocolPBServiceImpl.java:657)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:617)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {noformat}
> The cause is that currently resourceProfilesManager is not transferred to new 
> RMContext instance in RMContext#resetRMContext. We should do this transfer to 
> fix this error.
> {code:java}
> @@ -1488,6 +1488,10 @@ private void resetRMContext() {
>      // transfer service context to new RM service Context
>      rmContextImpl.setServiceContext(rmContext.getServiceContext());
> +    // transfer resource profiles manager
> +    rmContextImpl
> +        .setResourceProfilesManager(rmContext.getResourceProfilesManager());
> +
>      // reset dispatcher
>      Dispatcher dispatcher = setupDispatcher();
>      ((Service) dispatcher).init(this.conf);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to