Peter D Kirchner commented on YARN-3020:

It looks like the bug may have come in with the code reorganization of r1494017 
on 2013-06-18.  I did not follow the log past this introduction of 
AMRMClient.java in its present form and location.

In my code on my system (and I am supposing also in yours) each 
addContainerRequest() is taking about a second even without a sleep.  The 
heartbeat I set in createAMRMClientAsync() was 1000 milliseconds (1 second), so 
I set it to 10 seconds to rule out that the addContainerRequest() was somehow 
synchronous with allocate().  FWIW, for 10 containers requested, I got 17 
containers with a heartbeat of 10 seconds.  One heartbeat call to allocate() 
produced 7 containers, the next call produced 10.  Each heartbeat on which the 
AMRMClient detects a change (in the number of containers the AM has "add"ed) 
that needs to be sent to the RM, it sends the then-current total not the diff.

Limiting the AM to ~1 container request per second is impractical, so the bug 
is potentially initially helpful because the application does not have to wait 
2 minutes to assemble 100 containers, all it needs to do is call 
addContainerRequest() about 15 times, taking about 15 seconds with a 1 second 
heartbeat.  The addContainerRequest() performance will need to be improved, or 
the limitation of 1 container per addContainerRequest() introduced in r1503960 
2013-07-16 will need to be reversed.

But by the time one naively requests 100 containers, and get 5,050, The bug is 
probably hurting application and cluster performance.  Maybe a lot.

> n similar addContainerRequest()s produce n*(n+1)/2 containers
> -------------------------------------------------------------
>                 Key: YARN-3020
>                 URL: https://issues.apache.org/jira/browse/YARN-3020
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>            Reporter: Peter D Kirchner
>   Original Estimate: 24h
>  Remaining Estimate: 24h
> BUG: If the application master calls addContainerRequest() n times, but with 
> the same priority, I get up to 1+2+3+...+n containers = n*(n+1)/2 .  The most 
> containers are requested when the interval between calls to 
> addContainerRequest() exceeds the heartbeat interval of calls to allocate() 
> (in AMRMClientImpl's run() method).
> If the application master calls addContainerRequest() n times, but with a 
> unique priority each time, I get n containers (as I intended).
> Analysis:
> There is a logic problem in AMRMClientImpl.java.
> Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent 
> calls to addContainerRequest(), addResourceRequest() finds the previous 
> matching remoteRequest and increments the container count rather than 
> starting anew, and does an addResourceRequestToAsk() which defeats the 
> ask.clear().
> From documentation and code comments, it was hard for me to discern the 
> intended behavior of the API, but the inconsistency reported in this issue 
> suggests one case or the other is implemented incorrectly.

This message was sent by Atlassian JIRA

Reply via email to