[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-4138: Attachment: YARN-4138.7.patch Fix the failed unit test case {{testDecreaseAfterIncreaseWithAllocationExpiration}}. Attaching the latest patch. > Roll back container resource allocation after resource increase token expires > - > > Key: YARN-4138 > URL: https://issues.apache.org/jira/browse/YARN-4138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-4138-YARN-1197.1.patch, > YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, > YARN-4138.5.patch, YARN-4138.6.patch, YARN-4138.7.patch > > > In YARN-1651, after container resource increase token expires, the running > container is killed. > This ticket will change the behavior such that when a container resource > increase token expires, the resource allocation of the container will be > reverted back to the value before the increase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-4138: Attachment: YARN-4138.6.patch Attaching a new patch that updates {{lastConfirmedResource}} based on NM reported increased containers. Since the {{containerIncreasedOnNode}} function updates {{lastConfirmedResource}}, we will guard the content with a queue lock, but will drop the cs lock (this is consistent with other functions like {{rollbackContainerResource}}, {{updateIncreaseRequests}} and {{decreaseContainer}}). > Roll back container resource allocation after resource increase token expires > - > > Key: YARN-4138 > URL: https://issues.apache.org/jira/browse/YARN-4138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-4138-YARN-1197.1.patch, > YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, > YARN-4138.5.patch, YARN-4138.6.patch > > > In YARN-1651, after container resource increase token expires, the running > container is killed. > This ticket will change the behavior such that when a container resource > increase token expires, the resource allocation of the container will be > reverted back to the value before the increase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-4138: Attachment: YARN-4138.5.patch Hi, [~jianhe] bq. After step 6, rmContainer.getLastConfirmedResource() will return 3G, when the expire event gets triggered, won't it reset it back to 3G? No, it won't reset it back to 3G. rmContainer.getLastConfirmedResource() will not return 3G after step 6, it is still 1G. We only confirm resource when NM reported resource is the same as RM resource. In this test case, NM reported resource is 3G, but RM allocated resource is 6G, so 3G is NOT confirmed. This issues was discussed in this thread a while ago: https://issues.apache.org/jira/browse/YARN-4138?focusedCommentId=14737229=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14737229 bq. I think RMContainerImpl will not receive EXPIRE event at RUNNING state after this patch ? if so, we can remove this. You are right, we can remove this. Attaching the latest patch that remove this. > Roll back container resource allocation after resource increase token expires > - > > Key: YARN-4138 > URL: https://issues.apache.org/jira/browse/YARN-4138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-4138-YARN-1197.1.patch, > YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch, > YARN-4138.5.patch > > > In YARN-1651, after container resource increase token expires, the running > container is killed. > This ticket will change the behavior such that when a container resource > increase token expires, the resource allocation of the container will be > reverted back to the value before the increase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-4138: Attachment: YARN-4138.4.patch Attaching new patch now that YARN-4519 is completed. In {{rollbackContainerResource}} function, we will grab queue lock first, calculate the delta resource, and then call {{decreaseContainer}}. There is no need to grab the cs lock. > Roll back container resource allocation after resource increase token expires > - > > Key: YARN-4138 > URL: https://issues.apache.org/jira/browse/YARN-4138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-4138-YARN-1197.1.patch, > YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch, YARN-4138.4.patch > > > In YARN-1651, after container resource increase token expires, the running > container is killed. > This ticket will change the behavior such that when a container resource > increase token expires, the resource allocation of the container will be > reverted back to the value before the increase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-4138: Attachment: YARN-4138.3.patch Attach latest patch that addresses [~jianhe] and [~sandflee]'s comments. I think the issue brought up by [~jianhe] is about race conditions between a normal resource decrease and a resource rollback. The proposed fix is to guard resource rollback with the same sequence of locks as the normal resource decrease, i.e., lock on application first, then on scheduler. So with the proposed fix, we can walk through the original example: 1. AM asks increase 2G -> 8G, and is approved by RM 2. AM does not increase the container, AM asks to decrease to 1G, and in the same time, increase expiration logic is triggered: * If the normal decrease is processed first: RM decrease 8G -> 1G (allocated and lastConfirmed are now set to 1G), and then rollback is processed: RM rollback 1G -> 1G (skip) * If rollback is processed first: RM rollback 8G -> 2G (allocated and lastConfirmed are now set to 2G), and then normal decrease is processed: RM decrease 2G -> 1G > Roll back container resource allocation after resource increase token expires > - > > Key: YARN-4138 > URL: https://issues.apache.org/jira/browse/YARN-4138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-4138-YARN-1197.1.patch, > YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch > > > In YARN-1651, after container resource increase token expires, the running > container is killed. > This ticket will change the behavior such that when a container resource > increase token expires, the resource allocation of the container will be > reverted back to the value before the increase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-4138: Attachment: YARN-4138-YARN-1197.2.patch Submit an updated patch that includes extensive test cases. > Roll back container resource allocation after resource increase token expires > - > > Key: YARN-4138 > URL: https://issues.apache.org/jira/browse/YARN-4138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-4138-YARN-1197.1.patch, YARN-4138-YARN-1197.2.patch > > > In YARN-1651, after container resource increase token expires, the running > container is killed. > This ticket will change the behavior such that when a container resource > increase token expires, the resource allocation of the container will be > reverted back to the value before the increase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-4138: Attachment: YARN-4138-YARN-1197.1.patch Thank you very much [~sunilg]. Attaching a WIP patch. Still need to add test cases. Some considerations: * I think there is no need to create a new {{ContainerResourceIncreaseAllocationExpirer}}. We can just reuse the existing {{ContainerAllocationExpirer}}, and change the type parameter. See below. * Propose a new type parameter {{AllocationExpirationInfo}} which wraps the containerId, and a boolean value to indicate if this is for increase expiration. * Modify {{ContainerExpiredSchedulerEvent}} to add a boolean field to indicate if this event is for increase expiration. * Add RMContainerImpl.lastConfirmedResource to track the resource to rollback to when increase token expires * When Scheduler receives the CONTAINER_EXPIRED event for container resource increase, it calls the existing {{decreaseContainer}} to rollback resources. > Roll back container resource allocation after resource increase token expires > - > > Key: YARN-4138 > URL: https://issues.apache.org/jira/browse/YARN-4138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: MENG DING >Assignee: MENG DING > Attachments: YARN-4138-YARN-1197.1.patch > > > In YARN-1651, after container resource increase token expires, the running > container is killed. > This ticket will change the behavior such that when a container resource > increase token expires, the resource allocation of the container will be > reverted back to the value before the increase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4138) Roll back container resource allocation after resource increase token expires
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4138: -- Assignee: MENG DING (was: Sunil G) > Roll back container resource allocation after resource increase token expires > - > > Key: YARN-4138 > URL: https://issues.apache.org/jira/browse/YARN-4138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: MENG DING >Assignee: MENG DING > > In YARN-1651, after container resource increase token expires, the running > container is killed. > This ticket will change the behavior such that when a container resource > increase token expires, the resource allocation of the container will be > reverted back to the value before the increase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)