[ 
https://issues.apache.org/jira/browse/YARN-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-7631:
-------------------------------
    Description: 
Today in AMRMClientImpl, the ResourceRequests (RR) are kept as: RequestId -> 
Priority -> ResourceName -> ExecutionType -> Resource (Capacity) -> 
ResourceRequestInfo (the actual RR). This means that only RRs with the same 
(requestId, priority, resourcename, executionType, resource) will be grouped 
and aggregated together. 

While in RM side, the mapping is SchedulerRequestKey (RequestId, priority) -> 
LocalityAppPlacementAllocator (ResourceName -> RR). 

The issue is that in RM side Resource is not in the key to the RR at all. (Note 
that executionType is also not in the RM side, but it is fine because RM 
handles it separately as container update requests.) This means that under the 
same value of (requestId, priority, resourcename), RRs with different Resource 
values will be grouped together and override each other in RM. As a result, 
some of the container requests are lost and will never be allocated. 
Furthermore, since the two RRs are kept under different keys in AMRMClient 
side, allocation of RR1 will only trigger cancel for RR1, the pending RR2 will 
not get resend as well. 

I’ve attached an unit test (resourcebug.patch) which is failing in trunk to 
illustrate this issue. 

  was:
Today in AMRMClientImpl, the ResourceRequests (RR) are kept as: RequestId -> 
Priority -> ResourceName -> ExecutionType -> Resource (Capacity) -> 
ResourceRequestInfo (the actual RR). 

This means that only RRs with the same (requestId, priority, resourcename, 
executionType, resource) will be grouped and aggregated together. 

While in RM side, the mapping is SchedulerRequestKey (RequestId, priority) -> 
LocalityAppPlacementAllocator (ResourceName -> RR). 

The issue is that in RM side Resource is not in the key to the RR at all. (Note 
that executionType is also not in the RM side, but it is fine because RM 
handles it separately as container update requests.) This means that under the 
same value of (requestId, priority, resourcename), RRs with different Resource 
values will be grouped together and override each other in RM. As a result, 
some of the container requests are lost and will never be allocated. 
Furthermore, since the two RRs are kept under different keys in AMRMClient 
side, allocation of RR1 will only trigger cancel for RR1, the pending RR2 will 
not get resend as well. 

I’ve attached an unit test (resourcebug.patch) which is failing in trunk to 
illustrate this issue. 


> ResourceRequest with different Capacity (Resource) overrides each other in RM
> -----------------------------------------------------------------------------
>
>                 Key: YARN-7631
>                 URL: https://issues.apache.org/jira/browse/YARN-7631
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Botong Huang
>         Attachments: resourcebug.patch
>
>
> Today in AMRMClientImpl, the ResourceRequests (RR) are kept as: RequestId -> 
> Priority -> ResourceName -> ExecutionType -> Resource (Capacity) -> 
> ResourceRequestInfo (the actual RR). This means that only RRs with the same 
> (requestId, priority, resourcename, executionType, resource) will be grouped 
> and aggregated together. 
> While in RM side, the mapping is SchedulerRequestKey (RequestId, priority) -> 
> LocalityAppPlacementAllocator (ResourceName -> RR). 
> The issue is that in RM side Resource is not in the key to the RR at all. 
> (Note that executionType is also not in the RM side, but it is fine because 
> RM handles it separately as container update requests.) This means that under 
> the same value of (requestId, priority, resourcename), RRs with different 
> Resource values will be grouped together and override each other in RM. As a 
> result, some of the container requests are lost and will never be allocated. 
> Furthermore, since the two RRs are kept under different keys in AMRMClient 
> side, allocation of RR1 will only trigger cancel for RR1, the pending RR2 
> will not get resend as well. 
> I’ve attached an unit test (resourcebug.patch) which is failing in trunk to 
> illustrate this issue. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to