[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2019-08-13 Thread Rick Moritz (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906150#comment-16906150
 ] 

Rick Moritz commented on YARN-1902:
---

Oh, and Hadoop 2.7 can also be added to the affected version list.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>Assignee: Sietse T. Au
>Priority: Major
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2019-08-13 Thread Rick Moritz (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906148#comment-16906148
 ] 

Rick Moritz commented on YARN-1902:
---

This bug actually also can cause application crashes, if the application 
handles "ContainerAllocated"-events by stockpiling them, and then scheduling 
tasks to these containers as they arrive. This usually leads to timeouts of the 
involved token, and very interesting guesswork, why program logic is attempting 
to launch containers that have been assigned obsolete tokens.

I also wonder how this mixes with the recent addition of "opportunistic 
allocation".

Hadoop 3 would have been a great opportunity to close this bug :(

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>Assignee: Sietse T. Au
>Priority: Major
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2016-05-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291103#comment-15291103
 ] 

Jason Lowe commented on YARN-1902:
--

Sorry for jumping in late, but I'd like to keep moving this forward.  There are 
a significant number of wasted container allocations as apps adjust their 
resource requests, and that adds unnecessary load to the RM and reduces cluster 
efficiency.

IMHO without a protocol overhaul the fix has to come from the RM side if we 
want to minimize the excess containers.  The inherent problem is that the AM is 
adjusting its resource request _without_ the knowledge of what the RM has 
already allocated since the last heartbeat.  Therefore if the RM sees an update 
to the AM ask during a heartbeat and that same heartbeat's response has 
containers already allocated it needs to adjust the AM's ask by the containers 
that are in that response.  For example:
# AM asks for 5 containers
# On a subsequent heartbeat the RM responds with 1 container
# On the next heartbeat the AM adjusts its ask to 4 containers, but the RM has 
already allocated the remaining 4 containers from the original ask.
# The RM needs to interpret the new ask not as 4 more containers but as 0 
containers since 4 of them are already satisfied in the current heartbeat's 
response.

If apps were well behaved, I think we could get most of the benefit by simply 
adjusting the new total (ANY) ask update by the number of containers in the 
same heartbeat's response.  It's true that an AM could get containers in that 
response that don't match its request, but a well-behaved app should realize 
that any container received counts against the total (ANY) resource request.  
Therefore if the app throws away the container but still needs another it must 
update at least the total container request to ask for a replacement.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>Assignee: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2015-05-20 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552375#comment-14552375
 ] 

MENG DING commented on YARN-1902:
-

I have been experimenting with the idea of changing AppSchedulingInfo to 
maintain a total request table, a fulfilled allocation table, and then 
calculate the difference of the two tables as the real outstanding request 
table used for scheduling. All is fine until I realized that this cannot handle 
one use case where a AMRMClient, right before sending the allocation heartbeat, 
removes all container requests, and add new container requests at the same 
priority and location (possibly with different resource capability).  
AppSchedulingInfo does not know about this, and may not treat the newly added 
container requests as outstanding requests.

I agree that currently I do not see a clean solution without affecting backward 
compatibility. 

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>Assignee: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2015-05-19 Thread Sietse T. Au (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550462#comment-14550462
 ] 

Sietse T. Au commented on YARN-1902:


All solutions will still be workarounds unless the protocol is revised. 

Another workaround would be to keep track of the requests by counting the 
number of requested containers and not sending new container requests to RM 
until the previous batch has been satisfied.

Consider the following scenario in the following order:
1. addContainerRequest is called n times and at each call the 
expectedContainers counter is incremented, the container request is added to a 
list of currentContainerRequests. 
2. allocate is called, a boolean waitingForResponse is set to true when 
ask.size > 0 which indicates container requests have been made.
3. addContainerRequest is called m times, since waitingForResponse is true, the 
request will be added to a list of queuedContainerRequests, the asks will be 
added to asksQueue and not asks. 
4. allocate is called, n - 1 containers are returned, expectedContainers will 
be decremented by n - 1.
5. allocate is called again, 1 container is returned, expectedContainers will 
be  0, 
waitingForResponse is set to false, 
for each currentContainerRequest removeContainerRequest,
currentContainerRequests = queuedContainerRequests, 
asks = asksQueue, 
expectedContainers = queuedContainerRequests.size
6. allocate is called and (3) will be submitted. 

Here, the satisfied container requests will be correctly removed from the table 
without user intervention and seems to apply to common use cases, excess 
containers now will only happen when containerRequest is removed after an 
allocate. But since there is no guarantee that it will be removed in time at 
the RM, it doesn't seem to be very significant. 

One problem here is that the expectedContainers will be invalid when you do the 
following: 
blacklist all the possible nodes, add container request, allocate,  remove 
blacklist, add container request, allocate.
This would make the client wait forever for a response of the first request as 
it will never be satisfied.

I'm not sure what else can be done by users apart from extending the 
AMRMClientImpl to fit their use case.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>Assignee: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2015-05-18 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548492#comment-14548492
 ] 

Bikas Saha commented on YARN-1902:
--

An alternate approach that we tried in Apache Tez is to wrap a TaskScheduler 
around the AMRMClient that would take request from the application and do the 
matching internally. Since it would know the matching, it could automatically 
remove the matched requests also. (Still does not remove the race condition but 
it cleaner wrt to the user as an API). The TaskScheduler was written to be 
independent of Tez code so that we could contribute it to YARN as a library, 
however we did not find time to do so. Now that code has evolved quite a bit 
but the original, well-tested code could still be extracted from Tez 0.1 branch 
and contributed to YARN if someone is interested in doing that work.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>Assignee: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2015-05-18 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548480#comment-14548480
 ] 

MENG DING commented on YARN-1902:
-

Thanks [~bikassaha] and [~vinodkv] for the education and background info. 
Really helpful. I can now appreciate that there is not a straightforward 
solution to this problem.

Originally I was coming from a pure user experience point of view, where I was 
thinking that if I ever want to use removeContainerRequest, it should only be 
because that I need to cancel previous add requests. Yes I may still get the 
number of containers from the previous requests, but that is understandable. 
However, I would have never thought that I still need to do 
removeContainerRequest to remove requests of matched containers in order to 
make the internal bookkeeping of AMRMClient correct. Why should a user worry 
about these things?

After reading the comments, I start to think that even if we were able to 
figure out which ResourceRequest to deduct from and automatically deduct it at 
the Client, it still won't solve race condition 1 (i.e., allocated containers 
are sitting in RM).

So rather than changing the client, can we not do something at the RM side? For 
example, in AppSchedulingInfo:
1. Maintain a table for total request *only*. The updateResourceRequests() call 
will update this table to reflect the total requests from the client (matching 
the client side remoteRequestsTable).
2. Maintain a table for requests that have been satisfied. Every time a 
successful allocation is made for this application, this table is updated.
3. The difference between table 1 and table 2 will be the outstanding resource 
requests. This table is updated at every updateResourceRequests() and every 
successful allocation. Of course proper synchronization needs to be taken care 
of.
4. The scheduling will be made based on the table 3 (i.e., the outstanding 
request table). 

Do you think if this is something worth considering?

Thanks a lot in advance.
Meng

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>Assignee: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2015-05-15 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546421#comment-14546421
 ] 

Bikas Saha commented on YARN-1902:
--

Yes. And then the RM may give a container on H1 which is not useful for the 
app. If we again auto-decrement and release the container then we end up with 2 
outstanding requests and the job will hang because it needs 3 containers.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>Assignee: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2015-05-15 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546418#comment-14546418
 ] 

Vinod Kumar Vavilapalli commented on YARN-1902:
---

bq. Wangda Tan mentioned offline that we could at-least deduct the count 
against the over-all number (ANY request) for a given priority.
Further thought tells me this is not desired in some cases as well.

Take the following example.

User originally wants: 1 container on H1, 1 container on H2, and 2 containers 
on R1 (rack). The request table becomes
|H1|1|
|H2|1|
|R1|2|
|*|4|

Now assuming RM returns a container on R2 (rack), auto-decrementing the request 
table will make it
|H1|1|
|H2|1|
|R1|2|
|*|3|

But user may actually want something like the following. This depends on what 
the user preferences are w.r.t scheduling.
|H1|0|
|H2|1|
|R1|2|
|*|3|

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>Assignee: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2015-05-15 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546391#comment-14546391
 ] 

Vinod Kumar Vavilapalli commented on YARN-1902:
---

This was discussed multiple times before.

Two kinds of races can happen. A resource-table deduction happens when
 # allocated containers are already sitting in the RM (tracked at YARN-110)
 # allocated containers are already sitting in the client library

Seems like this JIRA is talking about both (1) and (2).

The dist-shell example above sounds like it could be because of (1).

Re (2), as Bikas says, the notion of forcing apps to deduct requests after a 
successful allocation (using AMRMClient.removeContainerRequest()) was 
introduced because the library clearly doesn't have an idea of which 
ResourceRequest to deduct from. [~leftnoteasy] mentioned offline that we could 
at-least deduct the count against the over-all number (ANY request) for a given 
priority. /cc [~bikassaha]

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>Assignee: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2015-05-15 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546377#comment-14546377
 ] 

Bikas Saha commented on YARN-1902:
--

The AMRMClient was not written to automatically remove requests because it does 
not know which requests will be matched to allocated containers. The explicit 
contract is for users of AMRMClient to remove requests that have been matched 
to containers.

If we change that behavior to automatically remove requests then it may lead to 
issues where 2 entities are removing requests. 1) user 2) AMRMClient. So that 
change should only be made in a different version of AMRMClient or else 
existing users will break.

In the worst case, if the AMRMClient (automatically) removes the wrong request 
then the application will hang because the RM will not provide it the container 
that is needed. Not automatically removing the request has the downside of 
getting additional containers that need to be released by the application. We 
chose excess containers over hanging for the original implementation. 

Excess containers should happen rarely because the user controls when 
AMRMClient heartbeats to the RM and can do that after having removed all 
matched requests, so that the remote request table reflects the current state 
of outstanding requests. There may still be a race condition on the RM side 
that gives more containers. Excess containers can happen more often with 
AMRMClientAsync, because it heartbeats at a regular schedule and can send more 
requests than really outstanding if the heartbeat goes out before the user has 
removed the matched requests.


> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>Assignee: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2015-05-15 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546209#comment-14546209
 ] 

MENG DING commented on YARN-1902:
-

I was almost going to log the same issue when I saw this thread (and also 
YARN-3020) :-).

After reading all the discussions, and after reading the related code, I still 
believe this is a bug.

I understand what [~bikassaha] has said that the AM-RM protocol is NOT a delta 
protocol, and that currently user (i.e., ApplicationMaster) is responsible for 
calling removeContainerRequest() after receiving an allocation, but consider 
the follow simple modification to the packaged *distributedshell* application:

{code:title=ApplicationMaster.java|borderStyle=solid}
--- 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
+++ 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
@@ -805,6 +805,8 @@ public void onContainersAllocated(List 
allocatedContainers) {
 // as all containers may not be allocated at one go.
 launchThreads.add(launchThread);
 launchThread.start();
+ContainerRequest containerAsk = setupContainerAskForRM();
+amRMClient.removeContainerRequest(containerAsk);
   }
 }
{code}

The code simply removes a container request after successfully receiving an 
allocated container in the ApplicationMaster. When you submit this application 
by specifying, say, 3 containers on the CLI, you will sometimes get 4 
containers allocated (not counting the AM container)! 

{code}
root@node2:~# hadoop 
org.apache.hadoop.yarn.applications.distributedshell.Client -jar 
/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.0.0-SNAPSHOT.jar
 -shell_command "sleep 10" -num_containers 3 -timeout 2
{code}
{code}
root@node2:~# yarn container -list appattempt_1431531743796_0015_01
15/05/15 20:49:01 INFO client.RMProxy: Connecting to ResourceManager at 
node2/10.211.55.102:8032
15/05/15 20:49:01 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Total number of containers :5
  Container-IdStart Time Finish Time
   StateHost   Node Http Address
 LOG-URL
container_1431531743796_0015_01_05  Fri May 15 20:44:12 + 2015  
 N/A RUNNING node3:50093
http://node3:8042
http://node3:8042/node/containerlogs/container_1431531743796_0015_01_05/root
container_1431531743796_0015_01_01  Fri May 15 20:44:06 + 2015  
 N/A RUNNING node3:50093
http://node3:8042
http://node3:8042/node/containerlogs/container_1431531743796_0015_01_01/root
container_1431531743796_0015_01_02  Fri May 15 20:44:10 + 2015  
 N/A RUNNING node3:50093
http://node3:8042
http://node3:8042/node/containerlogs/container_1431531743796_0015_01_02/root
container_1431531743796_0015_01_04  Fri May 15 20:44:11 + 2015  
 N/A RUNNING node3:50093
http://node3:8042
http://node3:8042/node/containerlogs/container_1431531743796_0015_01_04/root
container_1431531743796_0015_01_03  Fri May 15 20:44:10 + 2015  
 N/A RUNNING node4:41128
http://node4:8042
http://node4:8042/node/containerlogs/container_1431531743796_0015_01_03/root
{code}

The *fundamental* problem here, I believe, is that the AMRMClient maintains an 
internal request table *remoteRequestsTable* that keeps track of *total* 
container requests (i.e., including container requests that have been 
satisfied, and that are not yet satisfied):

{code:title=AMRMClient.java|borderStyle=solid}
protected final 
  Map>>
remoteRequestsTable =
new TreeMap>>();
{code}

However, the corresponding table *requests* at the scheduler side (inside 
AppSchedulingInfo.java) keeps track of *outstanding* container requests (i.e., 
container requests that are not yet satisfied):

{code:title=AppSchedulingInfo.java|borderStyle=solid}
  final Map> requests =
new ConcurrentHashMap>();
{code}

Every time an allocation is successfully made, the decResourceRequest() or 
decrementOutstanding() call will update the *requests* table so that it only 
contains outstanding requests, but unfortunately, every time an 
ApplicationMaster heartbeat comes, the same *requests* table is updated by the 
updateResourceRequests() call with the tot

[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2015-05-08 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535164#comment-14535164
 ] 

Zhijie Shen commented on YARN-1902:
---

[~stau-tudelft], assigned this jira to you as you've contributed the patch. 
However, the patch doesn't apply any more. Moreover, would you please take a 
look at Bikas' latest comment? It seems that the API is not designed for 
sending the delta, after the container is allocated, you need to remove the 
request explicitly. Back to your example, your 1st container is allocated, you 
need to remove one request before next allocate call, to make sure you're going 
to ask for 11 more containers, but not 12.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>Assignee: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2015-03-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377886#comment-14377886
 ] 

Hadoop QA commented on YARN-1902:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12638931/YARN-1902.v3.patch
  against trunk revision 3ca5bd1.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7090//console

This message is automatically generated.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-11-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228780#comment-14228780
 ] 

Hadoop QA commented on YARN-1902:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12638931/YARN-1902.v3.patch
  against trunk revision 1556f86.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5951//console

This message is automatically generated.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-11-27 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228030#comment-14228030
 ] 

Bikas Saha commented on YARN-1902:
--

If all the requests are for the same location (say star) then the client needs 
to sends all outstanding requests to the RM. The AM-RM protocol is not a delta 
protocol. Sending deltas will not work because the RM expects all requests at a 
given location to be presented for every update. It simply sets that value from 
the request to its internal book-keeping objects (vs performing an increment 
for the "new" request).

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-11-14 Thread Sietse T. Au (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212100#comment-14212100
 ] 

Sietse T. Au commented on YARN-1902:


The scenarios are such that ContainerRequest is a separate object with each 
addContainerRequest call.

The point is that the user does not want its previous requests to be undone, 
but to do an additional request, while not yet having received all the 12 
containers. 

Let's say for Scenario 1 that z = 12
12 containers are requested, then asynchronously you start an allocated 
container, after this, 1 extra container is requested. However instead of 1 
extra container being requested, 13 containers will be requested. This - as far 
as I can see - does not necessarily involve the RM and could be solved at the 
client side's internal bookkeeping of requests, which is what the patch is 
trying to show.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-10-29 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188516#comment-14188516
 ] 

Bikas Saha commented on YARN-1902:
--

bq. Given a ContainerRequest x with Resource y, when addContainerRequest is 
called z times with x, allocate is called and at least one of the z allocated 
containers is started, then if another addContainerRequest call is done and 
subsequently an allocate call to the RM, (z+1) containers will be allocated, 
where 1 container is expected.

Firstly, I am not sure if the same ContainerRequest object can be passed 
multiple times in addContainerRequest. It should be different objects each time 
(even if they point to the same resource). This might have something to do with 
the internal book-keeping done for matching requests.

Secondly, after z requests are made and 1 allocation is received then z-1 
requests remain. If you are using AMRMClientImpl then its your (users) 
responsibility to call removeContainerRequest() for the request that was 
matched to this container. The AMRMClient does not know which of your z 
requests could be assigned to this container. So in the general case, it cannot 
automatically remove a request from the internal table because it does not know 
which request to remove. If the javadocs dont clarify these semantics then we 
can improve the javadocs.

Thirdly, the protocol between the AMRMClient and the RM has an inherent race. 
So if the client had earlier asked for z containers and in the next heartbeat 
reduces that to z-1, the RM may actually return z containers to it because it 
had already allocated them to this client before the client updated the RM with 
the new value.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-10-29 Thread Yogesh Sobale (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188190#comment-14188190
 ] 

Yogesh Sobale commented on YARN-1902:
-

Can someone please update ?

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-10-29 Thread Yogesh Sobale (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188188#comment-14188188
 ] 

Yogesh Sobale commented on YARN-1902:
-

Can some please update ?

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-09-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144653#comment-14144653
 ] 

Hadoop QA commented on YARN-1902:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12638931/YARN-1902.v3.patch
  against trunk revision 7aa667e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client:

org.apache.hadoop.yarn.client.TestResourceTrackerOnHA
org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA
org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5083//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5083//console

This message is automatically generated.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-09-23 Thread Yogesh Sobale (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144633#comment-14144633
 ] 

Yogesh Sobale commented on YARN-1902:
-

This Fix will be available in 2.5.0 onwards ?

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Sietse T. Au
>  Labels: client
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-04-11 Thread Sietse T. Au (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967251#comment-13967251
 ] 

Sietse T. Au commented on YARN-1902:


Is this the right way to approach this issue or should there be a more thorough 
analysis of the current implementation?

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Sietse T. Au
>  Labels: patch
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-04-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961588#comment-13961588
 ] 

Hadoop QA commented on YARN-1902:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12638931/YARN-1902.v3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3520//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3520//console

This message is automatically generated.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Sietse T. Au
>  Labels: patch
> Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-04-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961555#comment-13961555
 ] 

Hadoop QA commented on YARN-1902:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12638928/YARN-1902.v2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3519//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3519//console

This message is automatically generated.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Sietse T. Au
>  Labels: patch
> Attachments: YARN-1902.patch, YARN-1902.v2.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-04-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961449#comment-13961449
 ] 

Hadoop QA commented on YARN-1902:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12638692/YARN-1902.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3517//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3517//console

This message is automatically generated.

> Allocation of too many containers when a second request is done with the same 
> resource capability
> -
>
> Key: YARN-1902
> URL: https://issues.apache.org/jira/browse/YARN-1902
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Sietse T. Au
>  Labels: patch
> Attachments: YARN-1902.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is 
> called z times with x, allocate is called and at least one of the z allocated 
> containers is started, then if another addContainerRequest call is done and 
> subsequently an allocate call to the RM, (z+1) containers will be allocated, 
> where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
> are requested in both scenarios, but that only in the second scenario, the 
> correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused 
> by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any 
> information about whether a request has been sent to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers 
> received.
> The solution implemented is to initialize a new ResourceRequest in 
> ResourceRequestInfo when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.2#6252)