Re: Review Request 24915: Adding initial GC task delay on scheduler restart.
On Sept. 4, 2014, 11:41 p.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java, line 175 https://reviews.apache.org/r/24915/diff/1/?file=665740#file665740line175 Shouldn't this be == EXECUTOR_GC_INTERVAL to minimize the side-effects of a failover? As-is, you'll have a 15 minute period with 4x elevated GC runs. If so, this arg could collapse, and the new GcExecutorSettings method can go away. I was split originally on reusing the same interval but decided in favor of a shorter interval to catch up a bit faster. Fine either way. On Sept. 4, 2014, 11:41 p.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java, line 145 https://reviews.apache.org/r/24915/diff/1/?file=665741#file665741line145 Please wrap this with Collections.synchronizedMap to preserve thread safety. Done. - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/#review52376 --- On Aug. 20, 2014, 11:35 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/ --- (Updated Aug. 20, 2014, 11:35 p.m.) Review request for Aurora and Bill Farner. Bugs: AURORA-608 https://issues.apache.org/jira/browse/AURORA-608 Repository: aurora Description --- Adding initial GC task delay on scheduler restart. Diffs - src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f Diff: https://reviews.apache.org/r/24915/diff/ Testing --- gradle -Pq build Thanks, Maxim Khutornenko
Re: Review Request 24915: Adding initial GC task delay on scheduler restart.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/ --- (Updated Sept. 6, 2014, 1:18 a.m.) Review request for Aurora and Bill Farner. Changes --- CR comments. Bugs: AURORA-608 https://issues.apache.org/jira/browse/AURORA-608 Repository: aurora Description --- Adding initial GC task delay on scheduler restart. Diffs (updated) - src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f Diff: https://reviews.apache.org/r/24915/diff/ Testing --- gradle -Pq build Thanks, Maxim Khutornenko
Re: Review Request 24915: Adding initial GC task delay on scheduler restart.
On Sept. 2, 2014, 4:43 p.m., Bill Farner wrote: I'm concerned that the problem we're solving is underspecified. An immediate issue i have with this patch is that it introduces a memory leak (in practice, this is hidden with default settings due to failover failover hides this). I'll reply on the ticket to try to reach consensus on what the problem is and what it means to fix it. Maxim Khutornenko wrote: I don't think we are trying to fix a Mesos problem here. Regardless of the underlying Mesos resolution (MESOS-1646), I do think Aurora should be a good Mesos citizen here and avoid abnormal offer activity spikes driven by failovers. In order for that memory leak to expose itself in any reasonable manner, there must be an intensive and constant churn of hosts in the cluster. I don't think it's a likely scenario but if you are concerned it should be easy to converge on a hybrid solution with expiring cache where the EXECUTOR_GC_RESTART_INTERVAL kicks in any time an item is not found in the cache. I do think Aurora should be a good Mesos citizen here and avoid abnormal offer activity spikes driven by failovers Can you offer more detail here - why is this bad behavior? - Bill --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/#review52037 --- On Aug. 20, 2014, 11:35 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/ --- (Updated Aug. 20, 2014, 11:35 p.m.) Review request for Aurora and Bill Farner. Bugs: AURORA-608 https://issues.apache.org/jira/browse/AURORA-608 Repository: aurora Description --- Adding initial GC task delay on scheduler restart. Diffs - src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f Diff: https://reviews.apache.org/r/24915/diff/ Testing --- gradle -Pq build Thanks, Maxim Khutornenko
Re: Review Request 24915: Adding initial GC task delay on scheduler restart.
On Sept. 2, 2014, 4:43 p.m., Bill Farner wrote: I'm concerned that the problem we're solving is underspecified. An immediate issue i have with this patch is that it introduces a memory leak (in practice, this is hidden with default settings due to failover failover hides this). I'll reply on the ticket to try to reach consensus on what the problem is and what it means to fix it. Maxim Khutornenko wrote: I don't think we are trying to fix a Mesos problem here. Regardless of the underlying Mesos resolution (MESOS-1646), I do think Aurora should be a good Mesos citizen here and avoid abnormal offer activity spikes driven by failovers. In order for that memory leak to expose itself in any reasonable manner, there must be an intensive and constant churn of hosts in the cluster. I don't think it's a likely scenario but if you are concerned it should be easy to converge on a hybrid solution with expiring cache where the EXECUTOR_GC_RESTART_INTERVAL kicks in any time an item is not found in the cache. Bill Farner wrote: I do think Aurora should be a good Mesos citizen here and avoid abnormal offer activity spikes driven by failovers Can you offer more detail here - why is this bad behavior? We spread GC activity randomly to avoid hourly spikes (https://reviews.apache.org/r/16247/). Failovers, however, break that pattern and manifest in unusual GC activity spikes. All Mesos matters aside, I don't think scheduler interruptions should result in what appears externally as a catch-up behavior. - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/#review52037 --- On Aug. 20, 2014, 11:35 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/ --- (Updated Aug. 20, 2014, 11:35 p.m.) Review request for Aurora and Bill Farner. Bugs: AURORA-608 https://issues.apache.org/jira/browse/AURORA-608 Repository: aurora Description --- Adding initial GC task delay on scheduler restart. Diffs - src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f Diff: https://reviews.apache.org/r/24915/diff/ Testing --- gradle -Pq build Thanks, Maxim Khutornenko
Re: Review Request 24915: Adding initial GC task delay on scheduler restart.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/#review52328 --- Per offline discussion, i've realized my memory leak concern is ~moot since we already have the unbounded HostAttributes store. I liked your suggestion of augmenting the way this soft state is used: - use a Map, no expiry - when receiving an offer, don't kick off a GC task if a map entry is not present, store a future timestamp after which we should launch a GC task - Bill Farner On Aug. 20, 2014, 11:35 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/ --- (Updated Aug. 20, 2014, 11:35 p.m.) Review request for Aurora and Bill Farner. Bugs: AURORA-608 https://issues.apache.org/jira/browse/AURORA-608 Repository: aurora Description --- Adding initial GC task delay on scheduler restart. Diffs - src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f Diff: https://reviews.apache.org/r/24915/diff/ Testing --- gradle -Pq build Thanks, Maxim Khutornenko
Re: Review Request 24915: Adding initial GC task delay on scheduler restart.
On Sept. 4, 2014, 5:44 p.m., Bill Farner wrote: Per offline discussion, i've realized my memory leak concern is ~moot since we already have the unbounded HostAttributes store. I liked your suggestion of augmenting the way this soft state is used: - use a Map, no expiry - when receiving an offer, don't kick off a GC task if a map entry is not present, store a future timestamp after which we should launch a GC task Thanks. Just to make sure we are on the same page, the bullets above describe exactly how it's implemented by the current RB. - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/#review52328 --- On Aug. 20, 2014, 11:35 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/ --- (Updated Aug. 20, 2014, 11:35 p.m.) Review request for Aurora and Bill Farner. Bugs: AURORA-608 https://issues.apache.org/jira/browse/AURORA-608 Repository: aurora Description --- Adding initial GC task delay on scheduler restart. Diffs - src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f Diff: https://reviews.apache.org/r/24915/diff/ Testing --- gradle -Pq build Thanks, Maxim Khutornenko
Re: Review Request 24915: Adding initial GC task delay on scheduler restart.
On Sept. 4, 2014, 5:44 p.m., Bill Farner wrote: Per offline discussion, i've realized my memory leak concern is ~moot since we already have the unbounded HostAttributes store. I liked your suggestion of augmenting the way this soft state is used: - use a Map, no expiry - when receiving an offer, don't kick off a GC task if a map entry is not present, store a future timestamp after which we should launch a GC task Maxim Khutornenko wrote: Thanks. Just to make sure we are on the same page, the bullets above describe exactly how it's implemented by the current RB. Hah, i was indeed on a different page - taking another look. - Bill --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/#review52328 --- On Aug. 20, 2014, 11:35 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/ --- (Updated Aug. 20, 2014, 11:35 p.m.) Review request for Aurora and Bill Farner. Bugs: AURORA-608 https://issues.apache.org/jira/browse/AURORA-608 Repository: aurora Description --- Adding initial GC task delay on scheduler restart. Diffs - src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f Diff: https://reviews.apache.org/r/24915/diff/ Testing --- gradle -Pq build Thanks, Maxim Khutornenko
Re: Review Request 24915: Adding initial GC task delay on scheduler restart.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/#review52376 --- Ship it! LGTM overall. src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java https://reviews.apache.org/r/24915/#comment91149 Shouldn't this be == EXECUTOR_GC_INTERVAL to minimize the side-effects of a failover? As-is, you'll have a 15 minute period with 4x elevated GC runs. If so, this arg could collapse, and the new GcExecutorSettings method can go away. src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java https://reviews.apache.org/r/24915/#comment91148 Please wrap this with Collections.synchronizedMap to preserve thread safety. - Bill Farner On Aug. 20, 2014, 11:35 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/ --- (Updated Aug. 20, 2014, 11:35 p.m.) Review request for Aurora and Bill Farner. Bugs: AURORA-608 https://issues.apache.org/jira/browse/AURORA-608 Repository: aurora Description --- Adding initial GC task delay on scheduler restart. Diffs - src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f Diff: https://reviews.apache.org/r/24915/diff/ Testing --- gradle -Pq build Thanks, Maxim Khutornenko
Re: Review Request 24915: Adding initial GC task delay on scheduler restart.
On Sept. 2, 2014, 4:43 p.m., Bill Farner wrote: I'm concerned that the problem we're solving is underspecified. An immediate issue i have with this patch is that it introduces a memory leak (in practice, this is hidden with default settings due to failover failover hides this). I'll reply on the ticket to try to reach consensus on what the problem is and what it means to fix it. I don't think we are trying to fix a Mesos problem here. Regardless of the underlying Mesos resolution (MESOS-1646), I do think Aurora should be a good Mesos citizen here and avoid abnormal offer activity spikes driven by failovers. In order for that memory leak to expose itself in any reasonable manner, there must be an intensive and constant churn of hosts in the cluster. I don't think it's a likely scenario but if you are concerned it should be easy to converge on a hybrid solution with expiring cache where the EXECUTOR_GC_RESTART_INTERVAL kicks in any time an item is not found in the cache. - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/#review52037 --- On Aug. 20, 2014, 11:35 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/ --- (Updated Aug. 20, 2014, 11:35 p.m.) Review request for Aurora and Bill Farner. Bugs: AURORA-608 https://issues.apache.org/jira/browse/AURORA-608 Repository: aurora Description --- Adding initial GC task delay on scheduler restart. Diffs - src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f Diff: https://reviews.apache.org/r/24915/diff/ Testing --- gradle -Pq build Thanks, Maxim Khutornenko
Re: Review Request 24915: Adding initial GC task delay on scheduler restart.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/#review52037 --- I'm concerned that the problem we're solving is underspecified. An immediate issue i have with this patch is that it introduces a memory leak (in practice, this is hidden with default settings due to failover failover hides this). I'll reply on the ticket to try to reach consensus on what the problem is and what it means to fix it. - Bill Farner On Aug. 20, 2014, 11:35 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/ --- (Updated Aug. 20, 2014, 11:35 p.m.) Review request for Aurora and Bill Farner. Bugs: AURORA-608 https://issues.apache.org/jira/browse/AURORA-608 Repository: aurora Description --- Adding initial GC task delay on scheduler restart. Diffs - src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f Diff: https://reviews.apache.org/r/24915/diff/ Testing --- gradle -Pq build Thanks, Maxim Khutornenko
Review Request 24915: Adding initial GC task delay on scheduler restart.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24915/ --- Review request for Aurora and Bill Farner. Bugs: AURORA-608 https://issues.apache.org/jira/browse/AURORA-608 Repository: aurora Description --- Adding initial GC task delay on scheduler restart. Diffs - src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java 5a38a1f69ac5dbe68af3bfe175899ddee392880b src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java 65f404915bc60ffe11a7a57d9861ac5b37fa646a src/test/java/org/apache/aurora/scheduler/async/GcExecutorLauncherTest.java 04137072891b2a1f0ad663182629dd469b09324f Diff: https://reviews.apache.org/r/24915/diff/ Testing --- gradle -Pq build Thanks, Maxim Khutornenko