[jira] [Comment Edited] (YARN-7086) Release all containers aynchronously

2018-10-16 Thread Manikandan R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652065#comment-16652065
 ] 

Manikandan R edited comment on YARN-7086 at 10/16/18 4:57 PM:
--

 
 [~jlowe] Reduced I/O's by removing unnecessary stdout printing and reducing 
log level. With these changes, ran the test cases again and measurements (in 
ms) between different runs for each cases doesn't differ drastically. In 
addition to three cases, since original intent of this Jira is to release 
container asynchronously to avoid potential deadlocks, added 4th case of 
releasing container asynchronously for every single container sequentially just 
to understand the difference between multiple container list traversal vs 
handling single container separately. Based on the below results, 2nd case - 
multiple container list traversal is not only reduce the performance but 
increase the complexity of the code. With 4th case, code changes are simple and 
clean. Though 4th case time taken is high compared to 1st & 3rd case, can we 
pick 4th case given that we want to release containers async? Thoughts? 

 
||Run||Existing code||With Patch
 (Async release + multiple container list traversal)||With Patch
 (Not Async release + multiple container list traversal) ||With Patch 
 (Async Release for each container separately)||
|1|496|1430 |444|1067|
|2|490|1604 |453 |1401|
|3|427|1133 |438|972|
|4|482|1342 |429 |1228|
|5|459|1106 |412 |1176|
|Average of 5 runs|470.8|1323|435.2|1168.8|

 


was (Author: maniraj...@gmail.com):
 
[~jlowe] Reduced I/O's by removing unnecessary stdout printing and reducing log 
level. With these changes, ran the test cases again and measurements (in ms) 
between different runs for each cases doesn't differ drastically. In addition 
to three cases, since original intent of this Jira is to release container 
asynchronously, added 4th case of releasing container asynchronously for every 
single container sequentially just to understand the difference between 
multiple container list traversal vs handling single container separately. 
Based on the below results, 2nd case - multiple container list traversal is not 
only reduce the performance but increase the complexity of the code. With 4th 
case, code changes are simple and clean. Though 4th case time taken is high 
compared to 1st & 3rd case, can we pick 4th case given that we want to release 
containers async? Thoughts? 

 
||Run||Existing code||With Patch
(Async release + multiple container list traversal)||With Patch
(Not Async release + multiple container list traversal) ||With Patch 
(Async Release for each container separately)||
|1|496|1430 |444|1067|
|2|490|1604 |453 |1401|
|3|427|1133 |438|972|
|4|482|1342 |429 |1228|
|5|459|1106 |412 |1176|
|Average of 5 runs|470.8|1323|435.2|1168.8|

 

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch, YARN-7086.002.patch, 
> YARN-7086.Perf-test-case.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7086) Release all containers aynchronously

2018-08-23 Thread Manikandan R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590507#comment-16590507
 ] 

Manikandan R edited comment on YARN-7086 at 8/23/18 4:49 PM:
-

Thanks [~asuresh]

Attached .001 patch for early review. It has changes as described in 
https://issues.apache.org/jira/browse/YARN-7086?focusedCommentId=16140295=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16140295.

[~jlowe]

{quote}I think it would be a lot better if there was a bulk-release interface 
so we could grab the critical lock once.{quote}

I assume you are referring the lock inside LeafQueue#completedContainer(). If 
answer is yes, one approach would be doing changes in 
Scheduler#completedContainer(), Scheduler#completedContainerInternal() and 
LeafQueue#completedContainer() to accept list of containers and process 
accordingly as opposed to accepting single container. Currently, All these 
methods accepts single RMContainer and do the operation with respect to that. 
With this new approach, We will need to see how we can able to accept list and 
traverse accordingly. Can you please confirm this?


was (Author: maniraj...@gmail.com):
Attached .001 patch for early review. It has changes as described in 
https://issues.apache.org/jira/browse/YARN-7086?focusedCommentId=16140295=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16140295.

[~jlowe]

{quote}I think it would be a lot better if there was a bulk-release interface 
so we could grab the critical lock once.{quote}

I assume you are referring the lock inside LeafQueue#completedContainer(). If 
answer is yes, one approach would be doing changes in 
Scheduler#completedContainer(), Scheduler#completedContainerInternal() and 
LeafQueue#completedContainer() to accept list of containers and process 
accordingly as opposed to accepting single container. Currently, All these 
methods accepts single RMContainer and do the operation with respect to that. 
With this new approach, We will need to see how we can able to accept list and 
traverse accordingly. Can you please confirm this?

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7086) Release all containers aynchronously

2017-08-24 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140295#comment-16140295
 ] 

Arun Suresh edited comment on YARN-7086 at 8/24/17 4:51 PM:


Thanks for chiming in folks.
And yes, I agree with [~jlowe] too. To move forward, and if everyone if fine 
with the approach, I will post a patch that does the following:
* Introduce a *RELEASE_CONTAINERS* scheduler event : will refactor the existing 
RELEASE_CONTAINER event to take multiple containers.
* Will expose an aysnc release method in the AbstractYarnScheduler that takes a 
list of containers, will split the list into some (configured ?) max containers 
released at a time, and will send an event for each the sub-list.
* Route all calls to release containers from both the scheduler to the new API. 
Currently, the problematic ones are during app attempt complete, node removed 
and the schedulers's handling of AM's explicit release containers.


was (Author: asuresh):
Thanks for chiming in folks.
And yes, I agree with [~jlowe] too. To move forward, and if everyone if fine 
with the approach, I will post a patch that does the following:
* Introduce a *RELEASE_CONTAINERS* scheduler event : will refactor the existing 
RELEASE_CONTAINER event to take multiple containers.
* Will expose and aysnc release method in the AbstractYarnScheduler to takes a 
list of containers, will split the list into some (configured ?) max containers 
released at a time, and will send an event for each the sub-list.
* Route all calls to release containers from both the scheduler to the new API. 
Currently, the problematic ones are during app attempt complete, node removed 
and the schedulers's handling of AM's explicit release containers.

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org