MENG DING commented on YARN-1197:

Thanks [~vinodkv] and [~leftnoteasy] for the great comments!

*To [~vinodkv]:*

bq.  Expanding containers at ACQUIRED state sounds useful in theory. But agree 
with you that we can punt it for later.
Thanks for the confirmation :-)

bq. To your example of concurrent increase/decrease sizing requests from AM, 
shall we simply say that only one change-in-progress is allowed for any given 
Actually we really wanted to be able to achieve this, but with the current 
asymmetric logic of increasing resource from RM, and decreasing resource from 
NM, it doesn't seem to be possible :-( The reason is because:
* The increase action starts from AM requesting the increase from RM, being 
granted a resource increase token, then initiating the increase action on NM, 
until finally NM confirming with RM about the increase.
* Once an increase token has been granted to AM, and before it expires (10 
minutes by default), if AM does not initiate the increase action on NM, *NM 
will have no idea that an increase is already in progress*.
* If, at this moment, AM initiates a resource decrease action on NM, NM will go 
ahead and honor it. So in effect, there can be concurrent decrease/increase 
action going on, and there doesn't seem to be a way to block this.

bq. If we do the above, this will also simplify most of the code, as we will 
simply have the notion of a Change, instead of an explicit increase/decrease 
everywhere. For e.g., we will just have a ContainerResourceChangeExpirer.
I believe the ContainerResourceChangeExpirer only applies to the container 
resource increase action. The container decrease action goes directly through 
NM so it does not need an expiration logic.

bq. There will be races with container-states toggling from RUNNING to finished 
states, depending on when AM requests a size-change and when NMs report that a 
container finished. We can simply say that the state at the ResourceManager 

bq. Didn't understand why we need this RM-NM confirmation. The token from RM to 
AM to NM should be enough for NM to update its view, right?
This is the same as the reasons listed above.

bq. Instead of adding new records for ContainerResourceIncrease / decrease in 
AllocationResponse, should we add a new field in the API record itself stating 
if it is a New/Increased/Decreased container? If we move to a single change 
model, it's likely we will not even need this.

I am open to this suggestion. We could add a field in the existing 
*ContainerProto* to indicate if this Container is new/increased/decreased 
container. The only thing I am not sure is if we can still change the 
AllocateResponseProto now that the ContainerResourceIncrease/Decrease is 
already in the trunk?

bq. Any obviously invalid change-requests should be rejected right-away. For 
e.g, an increase to more than cluster's max container size. Seemed like you are 
suggesting we ignore the invalid requests.

Agreed that any invalid increase requests from AM to RM, and invalid decrease 
requests from AM to NM should be directly rejected. The 'ignore' case I was 
referring to is in the context of NodeUpdate from NM to RM.

bq. Nit: In the design doc, the high-level flow for container-increase point #7 
incorrectly talks about decrease instead of increase.

Yes, this is a mistake, and I will correct it.

bq. I propose we do this in a branch

Definitely. There is already a YARN-1197 branch, and we can simply work in that 

*To [~leftnoteasy]:*

bq. Actually the appoarch in design doc is this (Meng plz let me know if I 
misunderstood). In scheduler's implementation, it allows only one pending 
change request for same container, later change-request will either overwrite 
prior one or rejected.
The current design only allows one increase request in the whole system, which 
is guaranteed by the ContainerResourceIncreaseExpirer object. However, as 
explained above, we cannot block decrease action while an increase action is 
still in progress.

bq. 1) For the protocols between servers/AMs, mostly same to previous doc, the 
biggest change I can see is the ContainerResourceChangeProto in 
NodeHeartbeatResponseProto, which makes sense to me.

Yes, the ContainerResourceChangeProto is the biggest change. Glad that you 
agree with this new protocol :-)

bq. 2) For the client side change: 2.2.1, +1 to option 3.

Great. I will remove option 1 and option 2 from the design doc.

bq. 3) For scheduling part, {{The scheduling of an outstanding resource 
increase request to a container will be skipped if there are either:}}. Both of 
the two may not needed since AM can require for more resource when container 
increase (e.g. container increased to 4G, and AM wants it to be 6G before 
notify NM).

Good point, this could be very convenient in practice. However the thing that I 
have not figured out is how to handle the increase token expiration logic if we 
have multiple increase actions going on at the same time. The current 
expiration logic (section 2.3.2 in the design doc) only tracks one increase 
request for a container (container ID + original capacity for rollback). As an 
example, if AM is currently using 2G, and asks to increase to 4G, and then asks 
again to increase to 6G, but AM doesn't actually use any of the token to 
increase the resource on NM. In this case, RM can only revert the resource 
allocation back to 4G after expiration, not 2G.

bq. 4) We may not need "reserved increase request", all increase request should 
be considered to be "reserved". But we still need to respect orders of 
applications in LeafQueue, no matter it's original FIFO or Fair (added after 
YARN-3306). We can discuss more scheduling details in separated JIRA.
For sure. My knowledge in the scheduler side is still very limited, so I will 
continue to learn along the way.

By the way, thanks for clearing up the JIRAs. It's great that you are able to 
work on the RM/Scheduler! I am glad to take any unassigned tasks :-)

> Support changing resources of an allocated container
> ----------------------------------------------------
>                 Key: YARN-1197
>                 URL: https://issues.apache.org/jira/browse/YARN-1197
>             Project: Hadoop YARN
>          Issue Type: Task
>          Components: api, nodemanager, resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Wangda Tan
>         Attachments: YARN-1197_Design.pdf, mapreduce-project.patch.ver.1, 
> tools-project.patch.ver.1, yarn-1197-scheduler-v1.pdf, yarn-1197-v2.pdf, 
> yarn-1197-v3.pdf, yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, 
> yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, 
> yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, 
> yarn-server-resourcemanager.patch.ver.1
> The current YARN resource management logic assumes resource allocated to a 
> container is fixed during the lifetime of it. When users want to change a 
> resource 
> of an allocated container the only way is releasing it and allocating a new 
> container with expected size.
> Allowing run-time changing resources of an allocated container will give us 
> better control of resource usage in application side

This message was sent by Atlassian JIRA

Reply via email to