Re: Review Request 51662: AURORA-1602 Aurora admin commands for reconcilation

2016-09-09 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51662/#review148374
---


Ship it!




LGTM


src/test/java/org/apache/aurora/scheduler/reconciliation/TaskReconcilerTest.java
 (lines 143 - 146)


Assert after each test - 

reconciler.triggerExplicitReconciliation(Optional.of(BATCH_SIZE));
assertEquals(6L, explicitRuns.get());
assertEquals(2L, implicitRuns.get());

reconciler.triggerImplicitReconciliation();
assertEquals(6L, explicitRuns.get());
assertEquals(3L, implicitRuns.get());


- Santhosh Kumar Shanmugham


On Sept. 9, 2016, 11:43 a.m., Karthik Anantha Padmanabhan wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51662/
> ---
> 
> (Updated Sept. 9, 2016, 11:43 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> AURORA-1602 Aurora admin commands for reconcilation
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md bbf7198dc7ce53bb02a1bc59f75a661e23472913 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   
> src/main/java/org/apache/aurora/scheduler/reconciliation/TaskReconciler.java 
> 3275d72a0a74909e635bce615e2036ec72aa38ee 
>   
> src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java
>  929d021a336c6a3438613c9340c84a1096dc9069 
>   src/main/python/apache/aurora/admin/admin.py 
> 76009b9c1c7a130c25abad544a176dc590dafb12 
>   src/main/python/apache/aurora/client/api/__init__.py 
> 9149c3018ae58d405f284fcbd4076d251ccc8192 
>   
> src/test/java/org/apache/aurora/scheduler/reconciliation/TaskReconcilerTest.java
>  b9317dc20456f90057ec2bf4d10619a5ae986187 
>   
> src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
>  779dc302602ae8842084807ca868a491ea99b676 
>   src/test/python/apache/aurora/admin/test_admin.py 
> eb193c4a33f5275f05a995338446bec0e19bc1cf 
>   src/test/python/apache/aurora/client/api/test_scheduler_client.py 
> afac2500551af2fce406bb906aa4e33f353e90a1 
> 
> Diff: https://reviews.apache.org/r/51662/diff/
> 
> 
> Testing
> ---
> 
> * Manually tested on my local vagrant installation
> * ./build-support/jenkins/build.sh
> 
> 
> Thanks,
> 
> Karthik Anantha Padmanabhan
> 
>



Re: Review Request 51765: Batching writes - Part 3 (of 3): Converting TaskScheduler to use BatchWorker.

2016-09-09 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51765/#review148362
---


Ship it!




Master (c7f710a) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 9, 2016, 6:52 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51765/
> ---
> 
> (Updated Sept. 9, 2016, 6:52 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the final part of the `BatchWorker` conversion work that converts 
> `TaskScheduler`. See https://reviews.apache.org/r/51759 for more background 
> on the `BatchWorker`.
> 
> #Problem
> See https://reviews.apache.org/r/51759
> 
> #Remediation
> Task scheduling is one of the most dominant users of the write lock. It's 
> also one of the heaviest and the most latency-sensitive. As such, the default 
> max batch size is chosen conservatively low (3) and batch items are executed 
> in a blocking way. 
> 
> BTW, attempting to make task scheduling non-blocking resulted in a much worse 
> scheduling performance. The way our `DBTaskStore` is wired, all async 
> activities, including `EventBus` are bound to use a single async `Executor`, 
> which is currently limited at 8 threads [1]. Relying on the same `EventBus` 
> to deliver scheduling completion events resulted in slower scheduling perf as 
> those events were backed up behind all other activities, including tasks 
> status events, reconciliation and etc. Increasing the executor thread pool 
> size to a larger number on the other side, also increased the lock contention 
> defeating the whole purpose of this work.
> 
> #Results
> See https://reviews.apache.org/r/51759 for the lock contention results.
> 
> https://github.com/apache/aurora/blob/b24619b28c4dbb35188871bacd0091a9e01218e3/src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java#L51-L54
> 
> 
> Diffs
> -
> 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java 
> 9d0d40b82653fb923bed16d06546288a1576c21d 
>   src/main/java/org/apache/aurora/scheduler/scheduling/SchedulingModule.java 
> 11e8033438ad0808e446e41bb26b3fa4c04136c7 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> c044ebe6f72183a67462bbd8e5be983eb592c3e9 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> d266f6a25ae2360db2977c43768a19b1f1efe8ff 
>   src/test/java/org/apache/aurora/scheduler/http/AbstractJettyTest.java 
> c2ceb4e7685a9301f8014a9183e02fbad65bca26 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskGroupsTest.java 
> 95cf25eda0a5bfc0cc4c46d1439ebe9d5359ce79 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java
>  72562e6bd9a9860c834e6a9faa094c28600a8fed 
> 
> Diff: https://reviews.apache.org/r/51765/diff/
> 
> 
> Testing
> ---
> 
> All types of testing including deploying to test and production clusters.
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 51662: AURORA-1602 Aurora admin commands for reconcilation

2016-09-09 Thread Maxim Khutornenko


> On Sept. 9, 2016, 12:26 a.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/reconciliation/TaskReconciler.java,
> >  line 125
> > 
> >
> > This smells to me and seems awkward. I guess this stemms from the fact 
> > that the `I*` wrappers don't have Optional fields for integers. Instead of 
> > this here, why not have the signature be 
> > `triggerExplicitReconciliation(Optional batchSize)`?
> > 
> > You can then do `int trueBatchSize = 
> > batchSize.orElse(settings.explicitBatchSize)`.
> > 
> > You can then push validation of the value to the RPC/API layer.

+1. Validating at RPC layer and sending down an `Optional` seems a 
better solution here.


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51662/#review148292
---


On Sept. 9, 2016, 6:43 p.m., Karthik Anantha Padmanabhan wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51662/
> ---
> 
> (Updated Sept. 9, 2016, 6:43 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> AURORA-1602 Aurora admin commands for reconcilation
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md bbf7198dc7ce53bb02a1bc59f75a661e23472913 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   
> src/main/java/org/apache/aurora/scheduler/reconciliation/TaskReconciler.java 
> 3275d72a0a74909e635bce615e2036ec72aa38ee 
>   
> src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java
>  929d021a336c6a3438613c9340c84a1096dc9069 
>   src/main/python/apache/aurora/admin/admin.py 
> 76009b9c1c7a130c25abad544a176dc590dafb12 
>   src/main/python/apache/aurora/client/api/__init__.py 
> 9149c3018ae58d405f284fcbd4076d251ccc8192 
>   
> src/test/java/org/apache/aurora/scheduler/reconciliation/TaskReconcilerTest.java
>  b9317dc20456f90057ec2bf4d10619a5ae986187 
>   
> src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
>  779dc302602ae8842084807ca868a491ea99b676 
>   src/test/python/apache/aurora/admin/test_admin.py 
> eb193c4a33f5275f05a995338446bec0e19bc1cf 
>   src/test/python/apache/aurora/client/api/test_scheduler_client.py 
> afac2500551af2fce406bb906aa4e33f353e90a1 
> 
> Diff: https://reviews.apache.org/r/51662/diff/
> 
> 
> Testing
> ---
> 
> * Manually tested on my local vagrant installation
> * ./build-support/jenkins/build.sh
> 
> 
> Thanks,
> 
> Karthik Anantha Padmanabhan
> 
>



Re: Review Request 51662: AURORA-1602 Aurora admin commands for reconcilation

2016-09-09 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51662/#review148359
---


Ship it!




Master (c7f710a) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 9, 2016, 6:43 p.m., Karthik Anantha Padmanabhan wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51662/
> ---
> 
> (Updated Sept. 9, 2016, 6:43 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> AURORA-1602 Aurora admin commands for reconcilation
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md bbf7198dc7ce53bb02a1bc59f75a661e23472913 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   
> src/main/java/org/apache/aurora/scheduler/reconciliation/TaskReconciler.java 
> 3275d72a0a74909e635bce615e2036ec72aa38ee 
>   
> src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java
>  929d021a336c6a3438613c9340c84a1096dc9069 
>   src/main/python/apache/aurora/admin/admin.py 
> 76009b9c1c7a130c25abad544a176dc590dafb12 
>   src/main/python/apache/aurora/client/api/__init__.py 
> 9149c3018ae58d405f284fcbd4076d251ccc8192 
>   
> src/test/java/org/apache/aurora/scheduler/reconciliation/TaskReconcilerTest.java
>  b9317dc20456f90057ec2bf4d10619a5ae986187 
>   
> src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
>  779dc302602ae8842084807ca868a491ea99b676 
>   src/test/python/apache/aurora/admin/test_admin.py 
> eb193c4a33f5275f05a995338446bec0e19bc1cf 
>   src/test/python/apache/aurora/client/api/test_scheduler_client.py 
> afac2500551af2fce406bb906aa4e33f353e90a1 
> 
> Diff: https://reviews.apache.org/r/51662/diff/
> 
> 
> Testing
> ---
> 
> * Manually tested on my local vagrant installation
> * ./build-support/jenkins/build.sh
> 
> 
> Thanks,
> 
> Karthik Anantha Padmanabhan
> 
>



Review Request 51765: Batching writes - Part 3 (of 3): Converting TaskScheduler to use BatchWorker.

2016-09-09 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51765/
---

Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Repository: aurora


Description
---

This is the final part of the `BatchWorker` conversion work that converts 
`TaskScheduler`. See https://reviews.apache.org/r/51759 for more background on 
the `BatchWorker`.

#Problem
See https://reviews.apache.org/r/51759

#Remediation
Task scheduling is one of the most dominant users of the write lock. It's also 
one of the heaviest and the most latency-sensitive. As such, the default max 
batch size is chosen conservatively low (3) and batch items are executed in a 
blocking way. 

BTW, attempting to make task scheduling non-blocking resulted in a much worse 
scheduling performance. The way our `DBTaskStore` is wired, all async 
activities, including `EventBus` are bound to use a single async `Executor`, 
which is currently limited at 8 threads [1]. Relying on the same `EventBus` to 
deliver scheduling completion events resulted in slower scheduling perf as 
those events were backed up behind all other activities, including tasks status 
events, reconciliation and etc. Increasing the executor thread pool size to a 
larger number on the other side, also increased the lock contention defeating 
the whole purpose of this work.

#Results
See https://reviews.apache.org/r/51759 for the lock contention results.

https://github.com/apache/aurora/blob/b24619b28c4dbb35188871bacd0091a9e01218e3/src/main/java/org/apache/aurora/scheduler/async/AsyncModule.java#L51-L54


Diffs
-

  src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java 
9d0d40b82653fb923bed16d06546288a1576c21d 
  src/main/java/org/apache/aurora/scheduler/scheduling/SchedulingModule.java 
11e8033438ad0808e446e41bb26b3fa4c04136c7 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
c044ebe6f72183a67462bbd8e5be983eb592c3e9 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
d266f6a25ae2360db2977c43768a19b1f1efe8ff 
  src/test/java/org/apache/aurora/scheduler/http/AbstractJettyTest.java 
c2ceb4e7685a9301f8014a9183e02fbad65bca26 
  src/test/java/org/apache/aurora/scheduler/scheduling/TaskGroupsTest.java 
95cf25eda0a5bfc0cc4c46d1439ebe9d5359ce79 
  
src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java 
72562e6bd9a9860c834e6a9faa094c28600a8fed 

Diff: https://reviews.apache.org/r/51765/diff/


Testing
---

All types of testing including deploying to test and production clusters.


Thanks,

Maxim Khutornenko



Re: Review Request 51662: AURORA-1602 Aurora admin commands for reconcilation

2016-09-09 Thread Karthik Anantha Padmanabhan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51662/
---

(Updated Sept. 9, 2016, 6:43 p.m.)


Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.


Repository: aurora


Description
---

AURORA-1602 Aurora admin commands for reconcilation


Diffs (updated)
-

  RELEASE-NOTES.md bbf7198dc7ce53bb02a1bc59f75a661e23472913 
  api/src/main/thrift/org/apache/aurora/gen/api.thrift 
c5765b70501c101f0535b4eed94e9948c36808f9 
  src/main/java/org/apache/aurora/scheduler/reconciliation/TaskReconciler.java 
3275d72a0a74909e635bce615e2036ec72aa38ee 
  
src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java 
929d021a336c6a3438613c9340c84a1096dc9069 
  src/main/python/apache/aurora/admin/admin.py 
76009b9c1c7a130c25abad544a176dc590dafb12 
  src/main/python/apache/aurora/client/api/__init__.py 
9149c3018ae58d405f284fcbd4076d251ccc8192 
  
src/test/java/org/apache/aurora/scheduler/reconciliation/TaskReconcilerTest.java
 b9317dc20456f90057ec2bf4d10619a5ae986187 
  
src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
 779dc302602ae8842084807ca868a491ea99b676 
  src/test/python/apache/aurora/admin/test_admin.py 
eb193c4a33f5275f05a995338446bec0e19bc1cf 
  src/test/python/apache/aurora/client/api/test_scheduler_client.py 
afac2500551af2fce406bb906aa4e33f353e90a1 

Diff: https://reviews.apache.org/r/51662/diff/


Testing
---

* Manually tested on my local vagrant installation
* ./build-support/jenkins/build.sh


Thanks,

Karthik Anantha Padmanabhan



Re: Review Request 51763: Batching writes - Part 2 (of 3): Converting cron jobs to use BatchWorker.

2016-09-09 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51763/#review148356
---


Ship it!




Master (c7f710a) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 9, 2016, 6:13 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51763/
> ---
> 
> (Updated Sept. 9, 2016, 6:13 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the second part of the `BatchWorker` conversion work that moves cron 
> jobs to use non-blocking kill followups and reduces the number of trigger 
> threads. See https://reviews.apache.org/r/51759 for more background on the 
> `BatchWorker`.
> 
> #Problem
> The current implementation of the cron scheduling relies on a large number of 
> threads (`cron_scheduler_num_threads`=100) to support cron triggering and 
> killing existing tasks according to `KILL_EXISTING` collision policy. This 
> creates large spikes of activities at synchronized intervals as users tend to 
> schedule their cron runs around similar schedules. Moreover, the current 
> implementation re-acquires write locks multiple times to deliver on 
> `KILL_EXISTING` policy. 
> 
> #Remediation
> Trigger level batching is still done in a blocking way but multiple cron 
> triggers may be bundled together to share the same write transaction. Any 
> followups, however, are performed in a non-blocking way by relying on a 
> `BatchWorker.executeWithReplay()` and the `BatchWorkCompleted` notification. 
> In order to still ensure non-concurrent execution of a given job key trigger, 
> a token (job key) is saved within the trigger itself. A concurrent trigger 
> will bail if a kill followup is still in progress (token is set AND no entry 
> in `killFollowups` set exists yet).
> 
> #Results
> The above approach allowed reducing the number of cron threads to 10 and 
> likely can be reduced even further. See https://reviews.apache.org/r/51759 
> for the lock contention results.
> 
> 
> Diffs
> -
> 
>   commons/src/main/java/org/apache/aurora/common/util/BackoffHelper.java 
> 8e73dd9ebc43e06f696bbdac4d658e4b225e7df7 
>   commons/src/test/java/org/apache/aurora/common/util/BackoffHelperTest.java 
> bc30990d57f444f7d64805ed85c363f1302736d0 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> c07551e94f9221b5b21c5dc9715e82caa290c2e8 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/CronModule.java 
> 155d702d68367b247dd066f773c662407f0e3b5b 
>   
> src/test/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJobTest.java 
> 5c64ff2994e200b3453603ac5470e8e152cebc55 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 1c0a3fa84874d7bc185b78f13d2664cb4d8dd72f 
> 
> Diff: https://reviews.apache.org/r/51763/diff/
> 
> 
> Testing
> ---
> 
> All types of testing including deploying to test and production clusters.
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 51662: AURORA-1602 Aurora admin commands for reconcilation

2016-09-09 Thread Karthik Anantha Padmanabhan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51662/
---

(Updated Sept. 9, 2016, 6:32 p.m.)


Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.


Changes
---

* Address review comments from Zameer, Maxim.
* Add two new UTs.
* Validate the batch size input.


Repository: aurora


Description
---

AURORA-1602 Aurora admin commands for reconcilation


Diffs (updated)
-

  RELEASE-NOTES.md bbf7198dc7ce53bb02a1bc59f75a661e23472913 
  api/src/main/thrift/org/apache/aurora/gen/api.thrift 
c5765b70501c101f0535b4eed94e9948c36808f9 
  src/main/java/org/apache/aurora/scheduler/reconciliation/TaskReconciler.java 
3275d72a0a74909e635bce615e2036ec72aa38ee 
  
src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java 
929d021a336c6a3438613c9340c84a1096dc9069 
  src/main/python/apache/aurora/admin/admin.py 
76009b9c1c7a130c25abad544a176dc590dafb12 
  src/main/python/apache/aurora/client/api/__init__.py 
9149c3018ae58d405f284fcbd4076d251ccc8192 
  
src/test/java/org/apache/aurora/scheduler/reconciliation/TaskReconcilerTest.java
 b9317dc20456f90057ec2bf4d10619a5ae986187 
  
src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
 779dc302602ae8842084807ca868a491ea99b676 
  src/test/python/apache/aurora/admin/test_admin.py 
eb193c4a33f5275f05a995338446bec0e19bc1cf 
  src/test/python/apache/aurora/client/api/test_scheduler_client.py 
afac2500551af2fce406bb906aa4e33f353e90a1 

Diff: https://reviews.apache.org/r/51662/diff/


Testing
---

* Manually tested on my local vagrant installation
* ./build-support/jenkins/build.sh


Thanks,

Karthik Anantha Padmanabhan



Review Request 51763: Batching writes - Part 2 (of 3): Converting cron jobs to use BatchWorker.

2016-09-09 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51763/
---

Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Repository: aurora


Description
---

This is the second part of the `BatchWorker` conversion work that moves cron 
jobs to use non-blocking kill followups and reduces the number of trigger 
threads. See https://reviews.apache.org/r/51759 for more background on the 
`BatchWorker`.

#Problem
The current implementation of the cron scheduling relies on a large number of 
threads (`cron_scheduler_num_threads`=100) to support cron triggering and 
killing existing tasks according to `KILL_EXISTING` collision policy. This 
creates large spikes of activities at synchronized intervals as users tend to 
schedule their cron runs around similar schedules. Moreover, the current 
implementation re-acquires write locks multiple times to deliver on 
`KILL_EXISTING` policy. 

#Remediation
Trigger level batching is still done in a blocking way but multiple cron 
triggers may be bundled together to share the same write transaction. Any 
followups, however, are performed in a non-blocking way by relying on a 
`BatchWorker.executeWithReplay()` and the `BatchWorkCompleted` notification. In 
order to still ensure non-concurrent execution of a given job key trigger, a 
token (job key) is saved within the trigger itself. A concurrent trigger will 
bail if a kill followup is still in progress (token is set AND no entry in 
`killFollowups` set exists yet).

#Results
The above approach allowed reducing the number of cron threads to 10 and likely 
can be reduced even further. See https://reviews.apache.org/r/51759 for the 
lock contention results.


Diffs
-

  commons/src/main/java/org/apache/aurora/common/util/BackoffHelper.java 
8e73dd9ebc43e06f696bbdac4d658e4b225e7df7 
  commons/src/test/java/org/apache/aurora/common/util/BackoffHelperTest.java 
bc30990d57f444f7d64805ed85c363f1302736d0 
  src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
c07551e94f9221b5b21c5dc9715e82caa290c2e8 
  src/main/java/org/apache/aurora/scheduler/cron/quartz/CronModule.java 
155d702d68367b247dd066f773c662407f0e3b5b 
  src/test/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJobTest.java 
5c64ff2994e200b3453603ac5470e8e152cebc55 
  src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
1c0a3fa84874d7bc185b78f13d2664cb4d8dd72f 

Diff: https://reviews.apache.org/r/51763/diff/


Testing
---

All types of testing including deploying to test and production clusters.


Thanks,

Maxim Khutornenko



Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-09 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148349
---


Ship it!




Master (c7f710a) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 9, 2016, 5:29 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 9, 2016, 5:29 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but will become obvious in the part 2 (coming out shortly).  
> 
> [1] - 
> https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
>   

Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-09 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/
---

Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Repository: aurora


Description
---

This is the first (out of 3) patches intending to reduce storage write lock 
contention and as such improve overall system write throughput. It introduces 
the `BatchWorker` and migrates the majority of storage writes due to task 
status change events to use `TaskEventBatchWorker`.

#Problem
Our current storage system writes effectively behave as `SERIALIZABLE` 
transaction isolation level in SQL terms. This means all writes require 
exclusive access to the storage and no two transactions can happen in parallel 
[1]. While it certainly simplifies our implementation, it creates a single 
hotspot where multiple threads are competing for the storage write access. This 
type of contention only worsens as the cluster size grows, more tasks are 
scheduled, more status updates are processed, more subscribers are listening to 
status updates and etc. Eventually, the scheduler throughput (and especially 
task scheduling) becomes degraded to the extent that certain operations wait 
much longer (4x and more) for the lock acquisition than it takes to process 
their payload when inside the transaction. Some ops (like event processing) are 
generally tolerant of these types of delays. Others - not as much. The task 
scheduling suffers the most as backing up the scheduling queue directly affects 
t
 he Median Time To Assigned (MTTA).

#Remediation
Given the above, it's natural to assume that reducing the number of write 
transactions should help reducing the lock contention. This patch introduces a 
generic `BatchWorker` service that delivers a "best effort" batching approach 
by redirecting multiple individual write requests into a single FIFO queue 
served non-stop by a single dedicated thread. Every batch shares a single write 
transaction thus reducing the number of potential write lock requests. To 
minimize wait-in-queue time, items are dispatched immediately and the max 
number of items is bounded. There are a few `BatchWorker` instances specialized 
on particular workload types: task even processing, cron scheduling and task 
scheduling. Every instance can be tuned independently (max batch size) and 
provides specialized metrics helping to monitor each workload type perf.

#Results
The proposed approach has been heavily tested in production and delivered the 
best results. The lock contention latencies got down between 2x and 5x 
depending on the cluster load. A number of other approaches tried but discarded 
as not performing well or even performing much worse than the current master:
- Clock-driven batch execution - every batch is dispatched on a time schedule
- Max batch with a deadline - a batch is dispatched when max size is reached OR 
a timeout expires
- Various combinations of the above - some `BatchWorkers` are using 
clock-driven execution while others are using max batch with a deadline
- Completely non-blocking (event-based) completion notification - all call 
sites are notified of item completion via a `BatchWorkCompleted` event

Happy to provide more details on the above if interested.

#Upcoming
The introduction of the `BatchWorker` by itself was not enough to substantially 
improve the MTTA. It, however, paves the way for the next phase of scheduling 
perf improvement - taking more than 1 task from a given `TaskGroup` in a single 
scheduling round (coming soon). That improvement wouldn't deliver without 
decreasing the lock contention first. 

Note: it wasn't easy to have a clean diff split, so some functionality in 
`BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
patch but will become obvious in the part 2 (coming out shortly).  

[1] - 
https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555


Diffs
-

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
4a7ef0b1b90607f68d89fe8e207f42c42a8c56a0 
  src/main/java/org/apache/aurora/scheduler/events/PubsubEvent.java 
70b5470b9dad1af838b5222cae5ac86487e2f2e4 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f07746c2b990c1c2235e99f9e4775fc84f9c27b1 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java 
bbd971a2aa8a96cf79edd879ad60e1bebd933d79 
  src/main/java/org/apache/aurora/scheduler/state/MaintenanceController.java 
3c7cda09ab292d696070ca4d9dfedb1f6f71b0fe 
  
src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java 
594bb6219294dcc77d48dcad14e2a6f9caa0c534 
  

Re: Review Request 51758: Document how to generate a changelog

2016-09-09 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51758/#review148343
---


Ship it!




Ship It!

- Joshua Cohen


On Sept. 9, 2016, 1:40 p.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51758/
> ---
> 
> (Updated Sept. 9, 2016, 1:40 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Jake Farrell.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Let's pass on that druid knowledge from one release manager to the next.
> 
> 
> Diffs
> -
> 
>   docs/development/committers-guide.md 
> 14f3b52ed39f0e066986237d262ce4488491b3dc 
> 
> Diff: https://reviews.apache.org/r/51758/diff/
> 
> 
> Testing
> ---
> 
> Rendered version 
> https://github.com/StephanErb/aurora/blob/release-help/docs/development/committers-guide.md#creating-a-release
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>



Re: Review Request 51758: Document how to generate a changelog

2016-09-09 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51758/#review148334
---


Ship it!




Master (c7f710a) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 9, 2016, 1:40 p.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51758/
> ---
> 
> (Updated Sept. 9, 2016, 1:40 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Jake Farrell.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Let's pass on that druid knowledge from one release manager to the next.
> 
> 
> Diffs
> -
> 
>   docs/development/committers-guide.md 
> 14f3b52ed39f0e066986237d262ce4488491b3dc 
> 
> Diff: https://reviews.apache.org/r/51758/diff/
> 
> 
> Testing
> ---
> 
> Rendered version 
> https://github.com/StephanErb/aurora/blob/release-help/docs/development/committers-guide.md#creating-a-release
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>



Review Request 51758: Document how to generate a changelog

2016-09-09 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51758/
---

Review request for Aurora, Joshua Cohen and Jake Farrell.


Repository: aurora


Description
---

Let's pass on that druid knowledge from one release manager to the next.


Diffs
-

  docs/development/committers-guide.md 14f3b52ed39f0e066986237d262ce4488491b3dc 

Diff: https://reviews.apache.org/r/51758/diff/


Testing
---

Rendered version 
https://github.com/StephanErb/aurora/blob/release-help/docs/development/committers-guide.md#creating-a-release


Thanks,

Stephan Erb



Re: Review Request 51664: Document the Mesos containerizer

2016-09-09 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51664/
---

(Updated Sept. 9, 2016, 10:05 a.m.)


Review request for Aurora and Joshua Cohen.


Changes
---

Review changes


Bugs: AURORA-1640
https://issues.apache.org/jira/browse/AURORA-1640


Repository: aurora


Description
---

Included changes:

* enabled Docker support in our vagrant box
* consistent example jobs for both containerizers
* short enduser and operator  documentation
* shuffled the reference documentation so that it is clear certain limitations 
apply only to the Docker containerizer


Diffs (updated)
-

  docs/features/containers.md 6b9f717d672ef88be61341268f9c06923d217087 
  docs/operations/configuration.md 85787b0815e9f0cb37c21efc04923b0e0bd10bf9 
  docs/reference/configuration.md ff40262f1fb4eff780fbef17abcaecc457dc68d3 
  examples/jobs/docker/hello_docker.aurora 
d5611e6b78cb613c04e2fc3df54397ee4401a62e 
  examples/jobs/hello_docker_image.aurora PRE-CREATION 
  examples/vagrant/aurorabuild.sh 9daca067acc05c89ccc183e56bf30c787baf97dd 
  examples/vagrant/mesos_config/etc_mesos-slave/image_providers 
7168c0ca4a81e0a1ce2320534015c6692b804b54 
  examples/vagrant/mesos_config/etc_mesos-slave/isolation 
c36223028513ace26040620b813022accaf0edf3 

Diff: https://reviews.apache.org/r/51664/diff/


Testing
---

Rendered version available at 
https://github.com/StephanErb/aurora/tree/containerizer-docs/docs


Thanks,

Stephan Erb