[
https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163294#comment-15163294
]
Kuhu Shukla commented on YARN-4723:
-----------------------------------
The primary reason for this failure is the {{UnknownNodeId}} object. Even if we
do not put this dummy nodeId in the active RMNodes, and instead put it in
inactiveRMNodes, the transition from NEW to DECOMMISSIONED that makes the node
unusable(NODE_UNUSABLE) will trigger a NODE_UPDATE which instead would populate
the {{updatedNodes}} in the AllocateResponse.
{code}
@Override
public void handle(NodesListManagerEvent event) {
RMNode eventNode = event.getNode();
switch (event.getType()) {
case NODE_UNUSABLE:
LOG.debug(eventNode + " reported unusable");
unusableRMNodesConcurrentSet.add(eventNode);
for(RMApp app: rmContext.getRMApps().values()) {
if (!app.isAppFinalStateStored()) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
RMAppNodeUpdateType.NODE_UNUSABLE));
}
}
{code}
That being said, we should not add the node to active list, but the way to
solve this problem is to get rid of UnknownNodeId and have an anonymous classes
to initialize these dummy nodes.
For the unit test, I did call {{allocate}} for this scenario but that did not
replicate the issue until I explicitly set the updatedNodes to an UnknownNodeId
object.
Asking [~jlowe], [~templedf] for comments and corrections.
Excerpt from a sample test :
{code}
AllocateRequest allocateRequest =
Records.newRecord(AllocateRequest.class);
AllocateResponse resp = rmClient.allocate(allocateRequest);
NodeReport report = new NodeReportPBImpl();
report.setNodeId(new NodesListManager.UnknownNodeId("host2"));
List<NodeReport> reports = new ArrayList<NodeReport>();
reports.add(report);
resp.setUpdatedNodes(reports);
allocateRequest =
Records.newRecord(AllocateRequest.class);
YarnServiceProtos.AllocateResponseProto p = ((AllocateResponsePBImpl)
resp).getProto();
{code}
Proposed change in NodesListManager.java:
{code}
private void setDecomissionedNMs() {
Set<String> excludeList = hostsReader.getExcludedHosts();
for (final String host : excludeList) {
NodeId nodeId = makeUnknownNodeId(host);
RMNodeImpl rmNode = new RMNodeImpl(nodeId,
rmContext, host, -1, -1, makeUnknownNode(host), null, null);
rmContext.getInactiveRMNodes().putIfAbsent(rmNode.getNodeID().getHost(),rmNode);
rmNode.handle(new RMNodeEvent(rmNode.getNodeID(), RMNodeEventType
.DECOMMISSION));
}
}
{code}
{code}
Node makeUnknownNode(final String host) {
return new Node() {
@Override
public String getNetworkLocation() {
return null;
}
@Override
public void setNetworkLocation(String location) {
}
@Override
public String getName() {
return host;
}
@Override
public Node getParent() {
return null;
}
@Override
public void setParent(Node parent) {
}
@Override
public int getLevel() {
return 0;
}
@Override
public void setLevel(int i) {
}
};
}
{code}
> NodesListManager$UnknownNodeId ClassCastException
> -------------------------------------------------
>
> Key: YARN-4723
> URL: https://issues.apache.org/jira/browse/YARN-4723
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.7.3
> Reporter: Jason Lowe
> Assignee: Kuhu Shukla
> Priority: Critical
>
> Saw the following in an RM log:
> {noformat}
> 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC
> Server handler 5 on 8030, call
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff
> java.lang.ClassCastException:
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId
> cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl
> at
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271)
> at
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220)
> at
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712)
> at
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68)
> at
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658)
> at
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647)
> at
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
> at
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
> at
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335)
> at
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144)
> at
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175)
> at
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96)
> at
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
> at
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server.call(Server.java:2267)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)