MENG DING commented on YARN-1197:

[~leftnoteasy] I totally agree that Yarn should not mess with Java Xmx. Sorry 
for not being clear before.

While digging into the design details of this issue, there is (I think) a piece 
that is missing from the original design doc, which I hope to get some 
insights/clarifications from the community:

It seems there is no discussion about the container resource increase token 
expiration logic.

Here is what I think that should happen:

1. AM sends a container resource increase request to RM.
2. RM grants the request, allocating the additional resource to the container, 
updating its internal resource bookkeeping.
3. During the next AM-RM heartbeat, RM pulls the newly increased container, 
creates a token for it and sets the token in the allocation response. In the 
meantime, RM starts a timer for this granted increase (e.g., register with 
4. AM acquires the container resource increase token from the heartbeat 
response, then calls the NMClient API to launch the container resource increase 
action on NM.
5. NM receives the request, increases the monitoring quota of the container, 
and notifies the NodeStatusUpdater.
6. The NodeStatusUpdater informs the increase success to RM during regular 
NM-RM heartbeat.
7. Upon receiving the increase success message, the RM stops the timer (e.g, 
unregister with  ContainerResourceIncreaseExpirer).

If, however, the timer in RM expires, and no increase success message is 
received for this container, *RM must release the increased resource to the 
container, and update its internal resource bookkeeing*.

As such, NM-RM heartbeat must also include container resource increase message 
(which doesn't exist in the original design), otherwise the expiration logic 
will not work. 

In addition, RM must remember the original resource allocation to the container 
(this info may be stored in the ContainerResourceIncreaseExpirer), because in 
the case of expiration, RM needs to release the increased resource and revert 
back to the original resource allocation. This is different from a newly 
allocated container, in which case, RM simply needs to release the resource for 
the entire container when it expires.

To make matters more complicated, after a container resource increase token has 
been given out, and before it expires, there is no guarantee that AM won't 
issue a resource *decrease* action on the same container. Because the resource 
decrease action starts from NM, NM has no idea that a resource increase token 
on the same container has been issued, and that a resource increase action 
could happen anytime.

Given the above, here is what I propose to simplify things as much as we can 
without compromising the main functionality:

*At the RM side* 
1. During each scheduling, if RM finds that there are still granted container 
resource increase sitting in RM (i.e., not yet acquired by AM), it will skip 
scheduling any outstanding resource increase request to the same container.
2. During each scheduling, if RM finds that there is a granted container 
resource increase registered with ContainerResourceIncreaseExpirer, it will 
skip scheduling any outstanding resource increase request to the same container.

This will guarantee that at any time, there can be one and only one resource 
increase request for a container.

*At the NM side*
1. Create a map to track any resource increase or decrease action for a 
container in NMContext. At any time, there can only be either an increase 
action or a decrease action going on for a specific container. While an 
increase/decrease action is in progress in NM, any new request from AM to 
increase/decrease resource to the same container will be rejected (with proper 
error messages).

With the above logic, here is an example of what could happen:

1. A container is currently using 6G
2. AM asks RM to increase it to 8G
3. RM grants the increase request, allocates the resource to the container to 
8G, and issues a token to AM. It starts a timer and remembers the original 
resource allocation before the increase as 6G.
4. AM, instead of initiating the resource increase to NM, requests a resource 
decrease to NM to decrease it to 4G
5. The decrease is successful and RM gets the notification, and updates the 
container resource to 4G

After this, two possible sequences may occur:

6. Before the token expires, the AM requests the resource increase to NM
7 The increase is successful and RM gets the notification, and updates the 
container resource back to 8G


6. AM never sends the resource increase to NM
7. The token expires in RM. RM attempts to revert the resource increase (i.e., 
set the resource allocation back to 6G), but seeing that it is currently using 
4G, it will do nothing.

This is what I have for now. Sorry for the long email, and I hope that I have 
made myself clear. Please let me know if I am on the right track. Any 
comments/corrections/suggestions/advice will be extremely appreciated.


> Support changing resources of an allocated container
> ----------------------------------------------------
>                 Key: YARN-1197
>                 URL: https://issues.apache.org/jira/browse/YARN-1197
>             Project: Hadoop YARN
>          Issue Type: Task
>          Components: api, nodemanager, resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Wangda Tan
>         Attachments: mapreduce-project.patch.ver.1, 
> tools-project.patch.ver.1, yarn-1197-scheduler-v1.pdf, yarn-1197-v2.pdf, 
> yarn-1197-v3.pdf, yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, 
> yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, 
> yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, 
> yarn-server-resourcemanager.patch.ver.1
> The current YARN resource management logic assumes resource allocated to a 
> container is fixed during the lifetime of it. When users want to change a 
> resource 
> of an allocated container the only way is releasing it and allocating a new 
> container with expected size.
> Allowing run-time changing resources of an allocated container will give us 
> better control of resource usage in application side

This message was sent by Atlassian JIRA

Reply via email to