[
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546377#comment-14546377
]
Bikas Saha commented on YARN-1902:
----------------------------------
The AMRMClient was not written to automatically remove requests because it does
not know which requests will be matched to allocated containers. The explicit
contract is for users of AMRMClient to remove requests that have been matched
to containers.
If we change that behavior to automatically remove requests then it may lead to
issues where 2 entities are removing requests. 1) user 2) AMRMClient. So that
change should only be made in a different version of AMRMClient or else
existing users will break.
In the worst case, if the AMRMClient (automatically) removes the wrong request
then the application will hang because the RM will not provide it the container
that is needed. Not automatically removing the request has the downside of
getting additional containers that need to be released by the application. We
chose excess containers over hanging for the original implementation.
Excess containers should happen rarely because the user controls when
AMRMClient heartbeats to the RM and can do that after having removed all
matched requests, so that the remote request table reflects the current state
of outstanding requests. There may still be a race condition on the RM side
that gives more containers. Excess containers can happen more often with
AMRMClientAsync, because it heartbeats at a regular schedule and can send more
requests than really outstanding if the heartbeat goes out before the user has
removed the matched requests.
> Allocation of too many containers when a second request is done with the same
> resource capability
> -------------------------------------------------------------------------------------------------
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
> Issue Type: Bug
> Components: client
> Affects Versions: 2.2.0, 2.3.0, 2.4.0
> Reporter: Sietse T. Au
> Assignee: Sietse T. Au
> Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is
> called z times with x, allocate is called and at least one of the z allocated
> containers is started, then if another addContainerRequest call is done and
> subsequently an allocate call to the RM, (z+1) containers will be allocated,
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls.
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1)
> are requested in both scenarios, but that only in the second scenario, the
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused
> by the structure of the remoteRequestsTable. The consequence of Map<Resource,
> ResourceRequestInfo> is that ResourceRequestInfo does not hold any
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers
> received.
> The solution implemented is to initialize a new ResourceRequest in
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)