[ https://issues.apache.org/jira/browse/YARN-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044871#comment-15044871 ]
gu-chi commented on YARN-4427: ------------------------------ NM recovery is enabled, this is the precondition > NPE on handleNMContainerStatus when NM is registering to RM > ----------------------------------------------------------- > > Key: YARN-4427 > URL: https://issues.apache.org/jira/browse/YARN-4427 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Brahma Reddy Battula > Assignee: Brahma Reddy Battula > Priority: Critical > > *Seen the following in one of our environment when AM got allocated > container but failed to updated in the ZK Where cluster is having network > problem for sometime(up and down).* > {noformat} > 2015-12-07 16:39:38,489 | WARN | IPC Server handler 49 on 26003 | IPC Server > handler 49 on 26003, call > org.apache.hadoop.yarn.server.api.ResourceTrackerPB.registerNodeManager from > 9.91.8.220:52169 Call#17 Retry#0 | Server.java:2107 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.handleNMContainerStatus(ResourceTrackerService.java:286) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:395) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceTrackerPBServiceImpl.registerNodeManager(ResourceTrackerPBServiceImpl.java:54) > at > org.apache.hadoop.yarn.proto.ResourceTracker$ResourceTrackerService$2.callBlockingMethod(ResourceTracker.java:79) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2088) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2084) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1673) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2082) > {noformat} > Corresponding code, it might not match with {{branch-2.7/Trunk}} since we had > modified internally. > {code} > 284 RMAppAttempt rmAppAttempt = rmApp.getRMAppAttempt(appAttemptId); > 285 Container masterContainer = rmAppAttempt.getMasterContainer(); > 286 if (masterContainer.getId().equals(containerStatus.getContainerId()) > 287 && containerStatus.getContainerState() == ContainerState.COMPLETE) > { > 288 ContainerStatus status = > 289 ContainerStatus.newInstance(containerStatus.getContainerId(), > 290 containerStatus.getContainerState(), > containerStatus.getDiagnostics(), > 291 containerStatus.getContainerExitStatus()); > 292 // sending master container finished event. > 293 RMAppAttemptContainerFinishedEvent evt = > 294 new RMAppAttemptContainerFinishedEvent(appAttemptId, status, > 295 nodeId); > 296 rmContext.getDispatcher().getEventHandler().handle(evt); > 297 } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)