[
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546391#comment-14546391
]
Vinod Kumar Vavilapalli commented on YARN-1902:
-----------------------------------------------
This was discussed multiple times before.
Two kinds of races can happen. A resource-table deduction happens when
# allocated containers are already sitting in the RM (tracked at YARN-110)
# allocated containers are already sitting in the client library
Seems like this JIRA is talking about both (1) and (2).
The dist-shell example above sounds like it could be because of (1).
Re (2), as Bikas says, the notion of forcing apps to deduct requests after a
successful allocation (using AMRMClient.removeContainerRequest()) was
introduced because the library clearly doesn't have an idea of which
ResourceRequest to deduct from. [~leftnoteasy] mentioned offline that we could
at-least deduct the count against the over-all number (ANY request) for a given
priority. /cc [~bikassaha]
> Allocation of too many containers when a second request is done with the same
> resource capability
> -------------------------------------------------------------------------------------------------
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
> Issue Type: Bug
> Components: client
> Affects Versions: 2.2.0, 2.3.0, 2.4.0
> Reporter: Sietse T. Au
> Assignee: Sietse T. Au
> Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is
> called z times with x, allocate is called and at least one of the z allocated
> containers is started, then if another addContainerRequest call is done and
> subsequently an allocate call to the RM, (z+1) containers will be allocated,
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls.
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1)
> are requested in both scenarios, but that only in the second scenario, the
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused
> by the structure of the remoteRequestsTable. The consequence of Map<Resource,
> ResourceRequestInfo> is that ResourceRequestInfo does not hold any
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers
> received.
> The solution implemented is to initialize a new ResourceRequest in
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)