MENG DING commented on YARN-1197:
[~leftnoteasy] I totally agree that Yarn should not mess with Java Xmx. Sorry
for not being clear before.
While digging into the design details of this issue, there is (I think) a piece
that is missing from the original design doc, which I hope to get some
insights/clarifications from the community:
It seems there is no discussion about the container resource increase token
Here is what I think that should happen:
1. AM sends a container resource increase request to RM.
2. RM grants the request, allocating the additional resource to the container,
updating its internal resource bookkeeping.
3. During the next AM-RM heartbeat, RM pulls the newly increased container,
creates a token for it and sets the token in the allocation response. In the
meantime, RM starts a timer for this granted increase (e.g., register with
4. AM acquires the container resource increase token from the heartbeat
response, then calls the NMClient API to launch the container resource increase
action on NM.
5. NM receives the request, increases the monitoring quota of the container,
and notifies the NodeStatusUpdater.
6. The NodeStatusUpdater informs the increase success to RM during regular
7. Upon receiving the increase success message, the RM stops the timer (e.g,
unregister with ContainerResourceIncreaseExpirer).
If, however, the timer in RM expires, and no increase success message is
received for this container, *RM must release the increased resource to the
container, and update its internal resource bookkeeing*.
As such, NM-RM heartbeat must also include container resource increase message
(which doesn't exist in the original design), otherwise the expiration logic
will not work.
In addition, RM must remember the original resource allocation to the container
(this info may be stored in the ContainerResourceIncreaseExpirer), because in
the case of expiration, RM needs to release the increased resource and revert
back to the original resource allocation. This is different from a newly
allocated container, in which case, RM simply needs to release the resource for
the entire container when it expires.
To make matters more complicated, after a container resource increase token has
been given out, and before it expires, there is no guarantee that AM won't
issue a resource *decrease* action on the same container. Because the resource
decrease action starts from NM, NM has no idea that a resource increase token
on the same container has been issued, and that a resource increase action
could happen anytime.
Given the above, here is what I propose to simplify things as much as we can
without compromising the main functionality:
*At the RM side*
1. During each scheduling, if RM finds that there are still granted container
resource increase sitting in RM (i.e., not yet acquired by AM), it will skip
scheduling any outstanding resource increase request to the same container.
2. During each scheduling, if RM finds that there is a granted container
resource increase registered with ContainerResourceIncreaseExpirer, it will
skip scheduling any outstanding resource increase request to the same container.
This will guarantee that at any time, there can be one and only one resource
increase request for a container.
*At the NM side*
1. Create a map to track any resource increase or decrease action for a
container in NMContext. At any time, there can only be either an increase
action or a decrease action going on for a specific container. While an
increase/decrease action is in progress in NM, any new request from AM to
increase/decrease resource to the same container will be rejected (with proper
With the above logic, here is an example of what could happen:
1. A container is currently using 6G
2. AM asks RM to increase it to 8G
3. RM grants the increase request, allocates the resource to the container to
8G, and issues a token to AM. It starts a timer and remembers the original
resource allocation before the increase as 6G.
4. AM, instead of initiating the resource increase to NM, requests a resource
decrease to NM to decrease it to 4G
5. The decrease is successful and RM gets the notification, and updates the
container resource to 4G
After this, two possible sequences may occur:
6. Before the token expires, the AM requests the resource increase to NM
7 The increase is successful and RM gets the notification, and updates the
container resource back to 8G
6. AM never sends the resource increase to NM
7. The token expires in RM. RM attempts to revert the resource increase (i.e.,
set the resource allocation back to 6G), but seeing that it is currently using
4G, it will do nothing.
This is what I have for now. Sorry for the long email, and I hope that I have
made myself clear. Please let me know if I am on the right track. Any
comments/corrections/suggestions/advice will be extremely appreciated.
> Support changing resources of an allocated container
> Key: YARN-1197
> URL: https://issues.apache.org/jira/browse/YARN-1197
> Project: Hadoop YARN
> Issue Type: Task
> Components: api, nodemanager, resourcemanager
> Affects Versions: 2.1.0-beta
> Reporter: Wangda Tan
> Attachments: mapreduce-project.patch.ver.1,
> tools-project.patch.ver.1, yarn-1197-scheduler-v1.pdf, yarn-1197-v2.pdf,
> yarn-1197-v3.pdf, yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf,
> yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1,
> yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1,
> The current YARN resource management logic assumes resource allocated to a
> container is fixed during the lifetime of it. When users want to change a
> of an allocated container the only way is releasing it and allocating a new
> container with expected size.
> Allowing run-time changing resources of an allocated container will give us
> better control of resource usage in application side
This message was sent by Atlassian JIRA