[jira] [Commented] (YARN-7086) Release all containers aynchronously

2024-01-04 Thread Shilun Fan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802704#comment-17802704
 ] 

Shilun Fan commented on YARN-7086:
--

Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a 
blocker. Retarget 3.5.0.

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch, YARN-7086.002.patch, 
> YARN-7086.Perf-test-case.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2019-01-09 Thread Manikandan R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739035#comment-16739035
 ] 

Manikandan R commented on YARN-7086:


[~asuresh] [~leftnoteasy] [~sunilg] Can you please share your views?

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch, YARN-7086.002.patch, 
> YARN-7086.Perf-test-case.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2018-12-03 Thread Manikandan R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707575#comment-16707575
 ] 

Manikandan R commented on YARN-7086:


Thanks [~jlowe]. [~asuresh] [~leftnoteasy] [~sunilg] Can you please share your 
thoughts?

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch, YARN-7086.002.patch, 
> YARN-7086.Perf-test-case.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2018-11-28 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702010#comment-16702010
 ] 

Jason Lowe commented on YARN-7086:
--

Sorry for the long delay.  It's good to see the performance number variance 
mostly eliminated.

I'm still not convinced this is something we want to do.  The performance 
numbers show that async release is almost 3x more expensive in terms of release 
latency than what we have today.  I think we need a clear use case showing that 
the increased latency is buying us something worth that increased cost, both in 
terms of latency and code complexity.  "Given we want to release containers 
async" was based on the old code where there was a very expensive lock being 
acquired for each container release, but that does not appear to be the case in 
recent builds.  Now that the expensive lock is out of this critical path, I'm 
not sure we want or need this added complexity.

Are others seeing issues with bulk container releases in recent builds?  Is 
there still a general demand for this feature?


> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch, YARN-7086.002.patch, 
> YARN-7086.Perf-test-case.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2018-10-16 Thread Manikandan R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652065#comment-16652065
 ] 

Manikandan R commented on YARN-7086:


 
[~jlowe] Reduced I/O's by removing unnecessary stdout printing and reducing log 
level. With these changes, ran the test cases again and measurements (in ms) 
between different runs for each cases doesn't differ drastically. In addition 
to three cases, since original intent of this Jira is to release container 
asynchronously, added 4th case of releasing container asynchronously for every 
single container sequentially just to understand the difference between 
multiple container list traversal vs handling single container separately. 
Based on the below results, 2nd case - multiple container list traversal is not 
only reduce the performance but increase the complexity of the code. With 4th 
case, code changes are simple and clean. Though 4th case time taken is high 
compared to 1st & 3rd case, can we pick 4th case given that we want to release 
containers async? Thoughts? 

 
||Run||Existing code||With Patch
(Async release + multiple container list traversal)||With Patch
(Not Async release + multiple container list traversal) ||With Patch 
(Async Release for each container separately)||
|1|496|1430 |444|1067|
|2|490|1604 |453 |1401|
|3|427|1133 |438|972|
|4|482|1342 |429 |1228|
|5|459|1106 |412 |1176|
|Average of 5 runs|470.8|1323|435.2|1168.8|

 

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch, YARN-7086.002.patch, 
> YARN-7086.Perf-test-case.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2018-10-10 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645159#comment-16645159
 ] 

Jason Lowe commented on YARN-7086:
--

Thanks for developing a perf test case!  The huge variations in runtime need to 
be investigated.  The second test case variations are up to 63%, including 
multiple samples that are slower than existing code average.  With this data, I 
would argue the results are close to the noise range given the wild swings in 
measurements.  How could it sometimes be well over 50% faster sometimes?  Is 
the JVM hitting a large GC?  System I/O?  I see the test is spamming logs on 
stdout in a tight loop while measuring timing -- that's not good.  I could see 
I/O effects dominating the runtimes.  Try running this where the test produces 
as little output as possible while running.  No stdout printing in the tight 
loop, use a log4j.properties that suppresses the RM logging, etc.  We need to 
get the runs to be a lot more consistent, otherwise we're probably not 
measuring what we think we're measuring.


> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch, YARN-7086.002.patch, 
> YARN-7086.Perf-test-case.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2018-09-20 Thread Manikandan R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623100#comment-16623100
 ] 

Manikandan R commented on YARN-7086:


[~jlowe] I did simple performance test to understand the containers release 
behaviour. Was trying to release 10K containers in single AM allocate call and 
measured the time taken (in secs) for all containers release with below three 
different flows:

1. Exisitng code: No changes.

2. With Patch (Async release + multiple container list traversal): Used 
.002.patch as is with batch size as 1K.

3. With Patch (Not Async release + multiple container list traversal): Slightly 
modified .002.patch to call new completeContainers(Map containersToBeReleased, RMContainerEventType event) directly 
rather than going through events flow.

 
||Run||Existing code||With Patch
(Async release + multiple container list traversal)||With Patch
(Not Async release + multiple container list traversal) ||
|1|6.8| 4.6|8.6|
|2|8.3| 7.5| 9.9|
|3|6.8| 7.2| 8.2|
|4|7.2| 7.1| 8.9|
|5| 7.2| 4.6| 10|
|Average of 5 runs|7.26|6.2|9.12|

 

Attaching patch containing only test case to explain the above flow. Can you 
please validate the approach?

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch, YARN-7086.002.patch, 
> YARN-7086.Perf-test-case.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2018-09-15 Thread Manikandan R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616408#comment-16616408
 ] 

Manikandan R commented on YARN-7086:


[~jlowe] Thanks for very detailed suggestion.
{quote}I'm worried that we're delving into the classic pitfall of optimizing 
without profiling data or hands-on experience to prove the optimizations make 
sense.{quote}Sorry about this. My understanding from earlier discussion is that 
there would be potential performance degradation with LeafQueue lock for sure 
and acquiring lock only once was mandatory for releasing batch of containers. 
Hence I went with this trade off (multiple container list traversal). Now that 
we are interested in doing the next step based on stress test (which is good 
for decision making), I will take a look on TestCapacitySchedulerPerf and 
perform the tests. Based on the numbers, as you suggested, it can help us to 
define the next steps clearly.

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch, YARN-7086.002.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2018-09-11 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610823#comment-16610823
 ] 

Jason Lowe commented on YARN-7086:
--

I'm worried that we're delving into the classic pitfall of optimizing without 
profiling data or hands-on experience to prove the optimizations make sense.  
As I mentioned above, the big bad lock that slowed container release down in 
the past is now gone, so I don't know if container release is really a big 
problem in trunk anymore.  I think we need some hard data on where the 
bottlenecks are in the updated trunk code with respect to container release and 
data showing this new setup is worth it, especially since we're making 
tradeoffs of multiple container list traversal vs. obtaining the LeafQueue 
lock.  The profile tests should test scenarios where a single container is 
being released and also scenarios where thousands of containers are being 
released in a single AM heartbeat.

We could try running SLS or develop some targeted unit tests to stress this 
code path.  See TestCapacitySchedulerPerf for an example of a unit test that is 
built to stress a particular aspect of the scheduler for performance testing.

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch, YARN-7086.002.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2018-08-26 Thread Manikandan R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592867#comment-16592867
 ] 

Manikandan R commented on YARN-7086:


[~jlowe] Thanks for sharing the background details.

Attached .002 patch to contain the changes required to acquire LeafQueue lock 
only once to release set of containers. Introduced wrapper methods on top of 
existing methods to re use the functionality wherever possible. On the flip 
side, ended up in traversing same set of containers for processing few more 
times. Please review and share your comments. If approach is fine, can drill 
down more to see for any further improvements.

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2018-08-23 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590556#comment-16590556
 ] 

Jason Lowe commented on YARN-7086:
--

bq. I assume you are referring the lock inside LeafQueue#completedContainer().

I was referring to the scheduler back in the 2.7/2.8 code which has changed 
considerably in trunk from that.  Back in 2.7 releasing a container required 
the highly-contended CapacityScheduler lock to be obtained, separately, for 
every container released.  When releasing a lot of containers in a single AM 
heartbeat, this caused a long backup as the highly-contended lock needed to be 
reacquired for every released container.  It would have been far more efficient 
to just grab the lock once and release all the containers with the lock held 
the entire time.

The big CapacityScheduler lock appears to be gone in trunk, so I would expect 
the next level of locking bottleneck to be the LeafQueue lock.

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2018-08-23 Thread Manikandan R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590507#comment-16590507
 ] 

Manikandan R commented on YARN-7086:


Attached .001 patch for early review. It has changes as described in 
https://issues.apache.org/jira/browse/YARN-7086?focusedCommentId=16140295=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16140295.

[~jlowe]

{quote}I think it would be a lot better if there was a bulk-release interface 
so we could grab the critical lock once.{quote}

I assume you are referring the lock inside LeafQueue#completedContainer(). If 
answer is yes, one approach would be doing changes in 
Scheduler#completedContainer(), Scheduler#completedContainerInternal() and 
LeafQueue#completedContainer() to accept list of containers and process 
accordingly as opposed to accepting single container. Currently, All these 
methods accepts single RMContainer and do the operation with respect to that. 
With this new approach, We will need to see how we can able to accept list and 
traverse accordingly. Can you please confirm this?

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2018-08-16 Thread Arun Suresh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16582803#comment-16582803
 ] 

Arun Suresh commented on YARN-7086:
---

Assigning to [~maniraj...@gmail.com], since he's kindly agreed to take this up..

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2017-08-24 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140295#comment-16140295
 ] 

Arun Suresh commented on YARN-7086:
---

Thanks for chiming in folks.
And yes, I agree with [~jlowe] too. To move forward, and if everyone if fine 
with the approach, I will post a patch that does the following:
* Introduce a *RELEASE_CONTAINERS* scheduler event : will refactor the existing 
RELEASE_CONTAINER event to take multiple containers.
* Will expose and aysnc release method in the AbstractYarnScheduler to takes a 
list of containers, will split the list into some (configured ?) max containers 
released at a time, and will send an event for each the sub-list.
* Route all calls to release containers from both the scheduler to the new API. 
Currently, the problematic ones are during app attempt complete, node removed 
and the schedulers's handling of AM's explicit release containers.

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2017-08-23 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139241#comment-16139241
 ] 

Wangda Tan commented on YARN-7086:
--

The only potential issue I can see is: prior to this change, AM can assume 
containers are released by RM once allocate() returns. In the new world, AM has 
to check completed container list in AllocateResponse to make sure containers 
are released. It may not be a big issue though since I don't think we guarantee 
this in API description.

Beyond that, I like Jason's idea as well, share one fact: When I was doing 
async scheduling test in YARN-5139, I found resource commit phase (acquires 
write lock, check and update scheduler internal state such as resource usages, 
etc.) only takes less than 6% time, most of the time are consumed by 
{{CapacityScheduler#allocateContainersToNode}}. I suspect container release 
take the similar amount of time (around 6%).

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2017-08-23 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138947#comment-16138947
 ] 

Karthik Kambatla commented on YARN-7086:


I like Jason's idea. 

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2017-08-23 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138915#comment-16138915
 ] 

Jason Lowe commented on YARN-7086:
--

We've noticed container release is particularly painful as well, although we 
haven't seen it deadlock.

Whether we do this asynchronously or not, one issue is that releasing a bunch 
of containers requires grabbing a highly-contended lock for every container 
released.  Do this in a loop and it ends up taking a long time since getting 
the lock is not cheap.  Async scheduling helps since we can wait in some other 
thread rather than in the AM handler threads or scheduler dispatcher thread, 
but it will still take a long time looping through all those events.  I think 
it would be a lot better if there was a bulk-release interface so we could grab 
the critical lock once.  We can put a limit on how many we do per batch if 
we're worried it will hold that lock for too long, but I don't think it's so 
much the actual work per container as it is the time spent waiting for the lock 
that makes this so painful.


> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org