[jira] [Commented] (YARN-5918) Opportunistic scheduling allocate request failure when NM lost

Arun Suresh (JIRA) Mon, 21 Nov 2016 15:35:59 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15685102#comment-15685102
 ]


Arun Suresh commented on YARN-5918:
-----------------------------------

Thanks for raising this [~bibinchundatt] and for chiming in [~varun_saxena].

bq. If we fix code as above, we will return less nodes for scheduling 
opportunistic containers than 
yarn.opportunistic-container-allocation.nodes-used configuration even though 
enough nodes are available. But this should be updated the very next second (as 
per default config) which maybe fine.
As you pointed out, this is actually fine.

bq. Although we remove node when a node is lost from cluster nodes, we do not 
remove it from sorted nodes. Because for doing it we will have to iterate over 
the list. Can we keep a set instead ?
We had initially thought of using a SortedSet, but Insertions and deletions 
were somewhat expensive and a LinkedList cheaply satisfied our use-case.

Can you maybe add a test to {{TestNodeQueueLoadMonitor}} for this ?
+1 pending.

> Opportunistic scheduling allocate request failure when NM lost
> --------------------------------------------------------------
>
>                 Key: YARN-5918
>                 URL: https://issues.apache.org/jira/browse/YARN-5918
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>         Attachments: YARN-5918.0001.patch
>
>
> Allocate request failure during Opportunistic container allocation when 
> nodemanager is lost 
> {noformat}
> 2016-11-20 10:38:49,011 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root     
> OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
> APPID=application_1479637990302_0002    
> CONTAINERID=container_e12_1479637990302_0002_01_000006  
> RESOURCE=<memory:1024, vCores:1>
> 2016-11-20 10:38:49,011 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Removed node docker2:38297 clusterResource: <memory:4096, vCores:8>
> 2016-11-20 10:38:49,434 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 7 on 8030, call Call#35 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 172.17.0.2:51584
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.convertToRemoteNode(OpportunisticContainerAllocatorAMService.java:420)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.convertToRemoteNodes(OpportunisticContainerAllocatorAMService.java:412)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.getLeastLoadedNodes(OpportunisticContainerAllocatorAMService.java:402)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.allocate(OpportunisticContainerAllocatorAMService.java:236)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:467)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:990)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2539)
> 2016-11-20 10:38:50,824 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e12_1479637990302_0002_01_000002 Container Transitioned from 
> RUNNING to COMPLETED
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-5918) Opportunistic scheduling allocate request failure when NM lost

Reply via email to