Bikas Saha commented on YARN-1902:

The AMRMClient was not written to automatically remove requests because it does 
not know which requests will be matched to allocated containers. The explicit 
contract is for users of AMRMClient to remove requests that have been matched 
to containers.

If we change that behavior to automatically remove requests then it may lead to 
issues where 2 entities are removing requests. 1) user 2) AMRMClient. So that 
change should only be made in a different version of AMRMClient or else 
existing users will break.

In the worst case, if the AMRMClient (automatically) removes the wrong request 
then the application will hang because the RM will not provide it the container 
that is needed. Not automatically removing the request has the downside of 
getting additional containers that need to be released by the application. We 
chose excess containers over hanging for the original implementation. 

Excess containers should happen rarely because the user controls when 
AMRMClient heartbeats to the RM and can do that after having removed all 
matched requests, so that the remote request table reflects the current state 
of outstanding requests. There may still be a race condition on the RM side 
that gives more containers. Excess containers can happen more often with 
AMRMClientAsync, because it heartbeats at a regular schedule and can send more 
requests than really outstanding if the heartbeat goes out before the user has 
removed the matched requests.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -------------------------------------------------------------------------------------------------
>                 Key: YARN-1902
>                 URL: https://issues.apache.org/jira/browse/YARN-1902
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.2.0, 2.3.0, 2.4.0
>            Reporter: Sietse T. Au
>            Assignee: Sietse T. Au
>              Labels: client
>         Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map<Resource, 
> ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.

This message was sent by Atlassian JIRA

Reply via email to