[ 
https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163294#comment-15163294
 ] 

Kuhu Shukla commented on YARN-4723:
-----------------------------------

The primary reason for this failure is the {{UnknownNodeId}} object. Even if we 
do not put this dummy nodeId in the active RMNodes, and instead put it in 
inactiveRMNodes, the transition from NEW to DECOMMISSIONED that makes the node 
unusable(NODE_UNUSABLE) will trigger a NODE_UPDATE which instead would populate 
the {{updatedNodes}} in the AllocateResponse.
{code}
  @Override
  public void handle(NodesListManagerEvent event) {
    RMNode eventNode = event.getNode();
    switch (event.getType()) {
    case NODE_UNUSABLE:
      LOG.debug(eventNode + " reported unusable");
      unusableRMNodesConcurrentSet.add(eventNode);
      for(RMApp app: rmContext.getRMApps().values()) {
        if (!app.isAppFinalStateStored()) {
          this.rmContext
              .getDispatcher()
              .getEventHandler()
              .handle(
                  new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
                      RMAppNodeUpdateType.NODE_UNUSABLE));
        }
      }
{code}

That being said, we should not add the node to active list, but the way to 
solve this problem is to get rid of UnknownNodeId and have an anonymous classes 
to initialize these dummy nodes.

For the unit test, I did call {{allocate}} for this scenario but that did not 
replicate the issue until I explicitly set the updatedNodes to an UnknownNodeId 
object. 

Asking [~jlowe], [~templedf] for comments and corrections.

Excerpt from a sample test :
{code}
AllocateRequest allocateRequest =
        Records.newRecord(AllocateRequest.class);
    AllocateResponse resp = rmClient.allocate(allocateRequest);
    NodeReport report = new NodeReportPBImpl();
    report.setNodeId(new NodesListManager.UnknownNodeId("host2"));
    List<NodeReport> reports = new ArrayList<NodeReport>();
    reports.add(report);
    resp.setUpdatedNodes(reports);
    allocateRequest =
        Records.newRecord(AllocateRequest.class);
    YarnServiceProtos.AllocateResponseProto p = ((AllocateResponsePBImpl) 
resp).getProto();
{code}

Proposed change in NodesListManager.java:
{code}
private void setDecomissionedNMs() {
    Set<String> excludeList = hostsReader.getExcludedHosts();
    for (final String host : excludeList) {
      NodeId nodeId = makeUnknownNodeId(host);
      RMNodeImpl rmNode = new RMNodeImpl(nodeId,
          rmContext, host, -1, -1, makeUnknownNode(host), null, null);
      
rmContext.getInactiveRMNodes().putIfAbsent(rmNode.getNodeID().getHost(),rmNode);
      rmNode.handle(new RMNodeEvent(rmNode.getNodeID(), RMNodeEventType
          .DECOMMISSION));
    }
  }
{code}

{code}
  Node makeUnknownNode(final String host) {
    return new Node() {
      @Override
      public String getNetworkLocation() {
        return null;
      }

      @Override
      public void setNetworkLocation(String location) {

      }

      @Override
      public String getName() {
        return host;
      }

      @Override
      public Node getParent() {
        return null;
      }

      @Override
      public void setParent(Node parent) {

      }

      @Override
      public int getLevel() {
        return 0;
      }

      @Override
      public void setLevel(int i) {

      }
    };
  }
{code}

> NodesListManager$UnknownNodeId ClassCastException
> -------------------------------------------------
>
>                 Key: YARN-4723
>                 URL: https://issues.apache.org/jira/browse/YARN-4723
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.3
>            Reporter: Jason Lowe
>            Assignee: Kuhu Shukla
>            Priority: Critical
>
> Saw the following in an RM log:
> {noformat}
> 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC 
> Server handler 5 on 8030, call 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff
> java.lang.ClassCastException: 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId 
> cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>         at org.apache.hadoop.ipc.Server.call(Server.java:2267)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to