[
https://issues.apache.org/jira/browse/YARN-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273990#comment-14273990
]
Peter D Kirchner commented on YARN-3020:
----------------------------------------
It looks like the bug may have come in with the code reorganization of r1494017
on 2013-06-18. I did not follow the log past this introduction of
AMRMClient.java in its present form and location.
In my code on my system (and I am supposing also in yours) each
addContainerRequest() is taking about a second even without a sleep. The
heartbeat I set in createAMRMClientAsync() was 1000 milliseconds (1 second), so
I set it to 10 seconds to rule out that the addContainerRequest() was somehow
synchronous with allocate(). FWIW, for 10 containers requested, I got 17
containers with a heartbeat of 10 seconds. One heartbeat call to allocate()
produced 7 containers, the next call produced 10. Each heartbeat on which the
AMRMClient detects a change (in the number of containers the AM has "add"ed)
that needs to be sent to the RM, it sends the then-current total not the diff.
Limiting the AM to ~1 container request per second is impractical, so the bug
is potentially initially helpful because the application does not have to wait
2 minutes to assemble 100 containers, all it needs to do is call
addContainerRequest() about 15 times, taking about 15 seconds with a 1 second
heartbeat. The addContainerRequest() performance will need to be improved, or
the limitation of 1 container per addContainerRequest() introduced in r1503960
2013-07-16 will need to be reversed.
But by the time one naively requests 100 containers, and get 5,050, The bug is
probably hurting application and cluster performance. Maybe a lot.
> n similar addContainerRequest()s produce n*(n+1)/2 containers
> -------------------------------------------------------------
>
> Key: YARN-3020
> URL: https://issues.apache.org/jira/browse/YARN-3020
> Project: Hadoop YARN
> Issue Type: Bug
> Components: client
> Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
> Reporter: Peter D Kirchner
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> BUG: If the application master calls addContainerRequest() n times, but with
> the same priority, I get up to 1+2+3+...+n containers = n*(n+1)/2 . The most
> containers are requested when the interval between calls to
> addContainerRequest() exceeds the heartbeat interval of calls to allocate()
> (in AMRMClientImpl's run() method).
> If the application master calls addContainerRequest() n times, but with a
> unique priority each time, I get n containers (as I intended).
> Analysis:
> There is a logic problem in AMRMClientImpl.java.
> Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent
> calls to addContainerRequest(), addResourceRequest() finds the previous
> matching remoteRequest and increments the container count rather than
> starting anew, and does an addResourceRequestToAsk() which defeats the
> ask.clear().
> From documentation and code comments, it was hard for me to discern the
> intended behavior of the API, but the inconsistency reported in this issue
> suggests one case or the other is implemented incorrectly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)