[
https://issues.apache.org/jira/browse/YARN-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15681783#comment-15681783
]
Varun Saxena commented on YARN-5918:
------------------------------------
While making null checks fixes the NPE, can something else be done or needs to
be done ? If we fix code as above, we will return less nodes for scheduling
opportunistic containers than
yarn.opportunistic-container-allocation.nodes-used configuration even though
enough nodes are available. But this should be updated the very next second (as
per default config) which maybe fine.
Cluster nodes are sorted in NodeQueueLoadMonitor every 1 second by default and
stored in a list. Although we remove node when a node is lost from cluster
nodes, we do not remove it from sorted nodes. Because for doing it we will have
to iterate over the list. Can we keep a set instead ? Also when we get least
loaded nodes when allocate request comes, we simply create a sublist from the
sorted nodes. We can potentially iterate over the list and check if node is
still running or not to avoid NPE but this would be slower than creating a
sublist especially number of nodes configured for scheduling opportunistic
containers are way larger than default of 10.
I guess we can check with guys working on distributed scheduling before
deciding on a fix.
cc [~asuresh]
> Opportunistic scheduling allocate request failure when NM lost
> --------------------------------------------------------------
>
> Key: YARN-5918
> URL: https://issues.apache.org/jira/browse/YARN-5918
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Bibin A Chundatt
> Assignee: Bibin A Chundatt
> Attachments: YARN-5918.0001.patch
>
>
> Allocate request failure during Opportunistic container allocation when
> nodemanager is lost
> {noformat}
> 2016-11-20 10:38:49,011 INFO
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root
> OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
> APPID=application_1479637990302_0002
> CONTAINERID=container_e12_1479637990302_0002_01_000006
> RESOURCE=<memory:1024, vCores:1>
> 2016-11-20 10:38:49,011 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Removed node docker2:38297 clusterResource: <memory:4096, vCores:8>
> 2016-11-20 10:38:49,434 WARN org.apache.hadoop.ipc.Server: IPC Server handler
> 7 on 8030, call Call#35 Retry#0
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from
> 172.17.0.2:51584
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.convertToRemoteNode(OpportunisticContainerAllocatorAMService.java:420)
> at
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.convertToRemoteNodes(OpportunisticContainerAllocatorAMService.java:412)
> at
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.getLeastLoadedNodes(OpportunisticContainerAllocatorAMService.java:402)
> at
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.allocate(OpportunisticContainerAllocatorAMService.java:236)
> at
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:467)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:990)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2539)
> 2016-11-20 10:38:50,824 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_e12_1479637990302_0002_01_000002 Container Transitioned from
> RUNNING to COMPLETED
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]