Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-16 Thread Joshua Cohen


> On Sept. 15, 2016, 10:58 a.m., Joshua Cohen wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, lines 166-167
> > 
> >
> > super nitpicky: mind swapping the order of these args to keep inline 
> > with `execute(Work work)`?
> 
> Maxim Khutornenko wrote:
> This ordering is deliberate to improve reading at the call site. The 
> verbose lambda is coming last in the `executeWithReplay` subjectively making 
> it easier to read (https://reviews.apache.org/r/51763/). If you disagree I 
> can reorder though.

Ahh, that makes sense. I'm fine to leave it as is.


- Joshua


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review149044
---


On Sept. 16, 2016, 8:33 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 16, 2016, 8:33 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-16 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review149263
---


Ship it!




Master (783baae) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 16, 2016, 8:33 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 16, 2016, 8:33 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but will become obvious in the part 2 (coming out shortly).  
> 
> [1] - 
> https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
>   

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-16 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/
---

(Updated Sept. 16, 2016, 8:33 p.m.)


Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Changes
---

Adding `NoResult` to substitute `Void` results.


Repository: aurora


Description
---

This is the first (out of 3) patches intending to reduce storage write lock 
contention and as such improve overall system write throughput. It introduces 
the `BatchWorker` and migrates the majority of storage writes due to task 
status change events to use `TaskEventBatchWorker`.

#Problem
Our current storage system writes effectively behave as `SERIALIZABLE` 
transaction isolation level in SQL terms. This means all writes require 
exclusive access to the storage and no two transactions can happen in parallel 
[1]. While it certainly simplifies our implementation, it creates a single 
hotspot where multiple threads are competing for the storage write access. This 
type of contention only worsens as the cluster size grows, more tasks are 
scheduled, more status updates are processed, more subscribers are listening to 
status updates and etc. Eventually, the scheduler throughput (and especially 
task scheduling) becomes degraded to the extent that certain operations wait 
much longer (4x and more) for the lock acquisition than it takes to process 
their payload when inside the transaction. Some ops (like event processing) are 
generally tolerant of these types of delays. Others - not as much. The task 
scheduling suffers the most as backing up the scheduling queue directly affects 
t
 he Median Time To Assigned (MTTA).

#Remediation
Given the above, it's natural to assume that reducing the number of write 
transactions should help reducing the lock contention. This patch introduces a 
generic `BatchWorker` service that delivers a "best effort" batching approach 
by redirecting multiple individual write requests into a single FIFO queue 
served non-stop by a single dedicated thread. Every batch shares a single write 
transaction thus reducing the number of potential write lock requests. To 
minimize wait-in-queue time, items are dispatched immediately and the max 
number of items is bounded. There are a few `BatchWorker` instances specialized 
on particular workload types: task even processing, cron scheduling and task 
scheduling. Every instance can be tuned independently (max batch size) and 
provides specialized metrics helping to monitor each workload type perf.

#Results
The proposed approach has been heavily tested in production and delivered the 
best results. The lock contention latencies got down between 2x and 5x 
depending on the cluster load. A number of other approaches tried but discarded 
as not performing well or even performing much worse than the current master:
- Clock-driven batch execution - every batch is dispatched on a time schedule
- Max batch with a deadline - a batch is dispatched when max size is reached OR 
a timeout expires
- Various combinations of the above - some `BatchWorkers` are using 
clock-driven execution while others are using max batch with a deadline
- Completely non-blocking (event-based) completion notification - all call 
sites are notified of item completion via a `BatchWorkCompleted` event

Happy to provide more details on the above if interested.

#Upcoming
The introduction of the `BatchWorker` by itself was not enough to substantially 
improve the MTTA. It, however, paves the way for the next phase of scheduling 
perf improvement - taking more than 1 task from a given `TaskGroup` in a single 
scheduling round (coming soon). That improvement wouldn't deliver without 
decreasing the lock contention first. 

Note: it wasn't easy to have a clean diff split, so some functionality in 
`BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
patch but will become obvious in the part 2 (coming out shortly).  

[1] - 
https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
4a7ef0b1b90607f68d89fe8e207f42c42a8c56a0 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f07746c2b990c1c2235e99f9e4775fc84f9c27b1 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java 
bbd971a2aa8a96cf79edd879ad60e1bebd933d79 
  src/main/java/org/apache/aurora/scheduler/state/MaintenanceController.java 
3c7cda09ab292d696070ca4d9dfedb1f6f71b0fe 
  
src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java 
594bb6219294dcc77d48dcad14e2a6f9caa0c534 
  

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-16 Thread Maxim Khutornenko


> On Sept. 15, 2016, 10:58 a.m., Joshua Cohen wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, lines 166-167
> > 
> >
> > super nitpicky: mind swapping the order of these args to keep inline 
> > with `execute(Work work)`?

This ordering is deliberate to improve reading at the call site. The verbose 
lambda is coming last in the `executeWithReplay` subjectively making it easier 
to read (https://reviews.apache.org/r/51763/). If you disagree I can reorder 
though.


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review149044
---


On Sept. 16, 2016, 8:03 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 16, 2016, 8:03 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-16 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review149256
---


Ship it!




Ship It!

- Zameer Manji


On Sept. 16, 2016, 1:03 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 16, 2016, 1:03 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but will become obvious in the part 2 (coming out shortly).  
> 
> [1] - 
> https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 4a7ef0b1b90607f68d89fe8e207f42c42a8c56a0 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-16 Thread Maxim Khutornenko


> On Sept. 16, 2016, 6:18 p.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 213
> > 
> >
> > I think this is a misuse of `CompletalbleFuture` I discovered this as I 
> > was reviewing https://reviews.apache.org/r/51765/.
> > 
> > I think here we need to use 
> > https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/CompletableFuture.html#completeExceptionally-java.lang.Throwable-
> >  here and set the exception into the future.
> > 
> > Then the exception will propagate to the caller (like 
> > `TaskGroupBatchWorker`).
> > 
> > Then the user of the batch worker can handle the exception.

Good catch. This is a leftover after refactoring `AwaitableResult` into 
`CompletableFuture` and it makes perfect sense to channel that error upstream.


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review149227
---


On Sept. 14, 2016, 10:41 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 10:41 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-16 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review149227
---




src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 213)


I think this is a misuse of `CompletalbleFuture` I discovered this as I was 
reviewing https://reviews.apache.org/r/51765/.

I think here we need to use 
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/CompletableFuture.html#completeExceptionally-java.lang.Throwable-
 here and set the exception into the future.

Then the exception will propagate to the caller (like 
`TaskGroupBatchWorker`).

Then the user of the batch worker can handle the exception.


- Zameer Manji


On Sept. 14, 2016, 3:41 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 3:41 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-15 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review149044
---


Ship it!




Overall lgtm. Agree w/ Zameer that we should ship all three of these tickets 
together though.


src/main/java/org/apache/aurora/scheduler/BatchWorker.java (lines 166 - 167)


super nitpicky: mind swapping the order of these args to keep inline with 
`execute(Work work)`?


- Joshua Cohen


On Sept. 14, 2016, 10:41 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 10:41 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but will become obvious in the part 2 (coming out shortly).  
> 
> [1] - 
> 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148994
---


Ship it!




Master (5069f93) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 14, 2016, 10:41 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 10:41 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but will become obvious in the part 2 (coming out shortly).  
> 
> [1] - 
> https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
>   

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Maxim Khutornenko


> On Sept. 14, 2016, 6:36 p.m., Zameer Manji wrote:
> > This patch LGTM. Just a single concern here and some questions. I will be 
> > moving on to the subsequent patches shortly.
> > Please do not commit this until all parts are shipped. Also, please don't 
> > forget to rebase/update subsequent patches if changes are made here.
> > 
> > Also, does this cover all state change consumers that interact on storage 
> > when the event is fired or is this a subset? Just wondering if this is 
> > supposed to apply to all of them or just a small set that you think need 
> > this change.
> 
> Maxim Khutornenko wrote:
> > Also, does this cover all state change consumers that interact on 
> storage when the event is fired or is this a subset
> 
> This covers all known high frequency consumers that require a write 
> transaction. The list is derived by running a stack-dumping version of the 
> `CallOrderEnforcingStorage.write()` in a live cluster for awhile and 
> analysing top offenders. It may be not fully comprehensive though, so if you 
> find any other candidates feel free to report them.
> 
> Zameer Manji wrote:
> I think some comments at the consumers to say that batch worker is 
> required for high throughput would be greatly appreciated for future 
> contributers.

I'd rather avoid comment copypasta everywhere we use `BatchWorker`. I thought 
it's enough to trace the execution from the callsite to the `BatchWorker` 
itself to see how/why it's used, no?


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148950
---


On Sept. 14, 2016, 10:41 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 10:41 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Zameer Manji


> On Sept. 14, 2016, 11:36 a.m., Zameer Manji wrote:
> > This patch LGTM. Just a single concern here and some questions. I will be 
> > moving on to the subsequent patches shortly.
> > Please do not commit this until all parts are shipped. Also, please don't 
> > forget to rebase/update subsequent patches if changes are made here.
> > 
> > Also, does this cover all state change consumers that interact on storage 
> > when the event is fired or is this a subset? Just wondering if this is 
> > supposed to apply to all of them or just a small set that you think need 
> > this change.
> 
> Maxim Khutornenko wrote:
> > Also, does this cover all state change consumers that interact on 
> storage when the event is fired or is this a subset
> 
> This covers all known high frequency consumers that require a write 
> transaction. The list is derived by running a stack-dumping version of the 
> `CallOrderEnforcingStorage.write()` in a live cluster for awhile and 
> analysing top offenders. It may be not fully comprehensive though, so if you 
> find any other candidates feel free to report them.

I think some comments at the consumers to say that batch worker is required for 
high throughput would be greatly appreciated for future contributers.


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148950
---


On Sept. 14, 2016, 3:41 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 3:41 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/
---

(Updated Sept. 14, 2016, 10:41 p.m.)


Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Changes
---

Stephan suggested offline a small optimization for the Work. Also, dropping 
a flaky assert in BatchWorkerTest that apparently trips our jenkins build.


Repository: aurora


Description
---

This is the first (out of 3) patches intending to reduce storage write lock 
contention and as such improve overall system write throughput. It introduces 
the `BatchWorker` and migrates the majority of storage writes due to task 
status change events to use `TaskEventBatchWorker`.

#Problem
Our current storage system writes effectively behave as `SERIALIZABLE` 
transaction isolation level in SQL terms. This means all writes require 
exclusive access to the storage and no two transactions can happen in parallel 
[1]. While it certainly simplifies our implementation, it creates a single 
hotspot where multiple threads are competing for the storage write access. This 
type of contention only worsens as the cluster size grows, more tasks are 
scheduled, more status updates are processed, more subscribers are listening to 
status updates and etc. Eventually, the scheduler throughput (and especially 
task scheduling) becomes degraded to the extent that certain operations wait 
much longer (4x and more) for the lock acquisition than it takes to process 
their payload when inside the transaction. Some ops (like event processing) are 
generally tolerant of these types of delays. Others - not as much. The task 
scheduling suffers the most as backing up the scheduling queue directly affects 
t
 he Median Time To Assigned (MTTA).

#Remediation
Given the above, it's natural to assume that reducing the number of write 
transactions should help reducing the lock contention. This patch introduces a 
generic `BatchWorker` service that delivers a "best effort" batching approach 
by redirecting multiple individual write requests into a single FIFO queue 
served non-stop by a single dedicated thread. Every batch shares a single write 
transaction thus reducing the number of potential write lock requests. To 
minimize wait-in-queue time, items are dispatched immediately and the max 
number of items is bounded. There are a few `BatchWorker` instances specialized 
on particular workload types: task even processing, cron scheduling and task 
scheduling. Every instance can be tuned independently (max batch size) and 
provides specialized metrics helping to monitor each workload type perf.

#Results
The proposed approach has been heavily tested in production and delivered the 
best results. The lock contention latencies got down between 2x and 5x 
depending on the cluster load. A number of other approaches tried but discarded 
as not performing well or even performing much worse than the current master:
- Clock-driven batch execution - every batch is dispatched on a time schedule
- Max batch with a deadline - a batch is dispatched when max size is reached OR 
a timeout expires
- Various combinations of the above - some `BatchWorkers` are using 
clock-driven execution while others are using max batch with a deadline
- Completely non-blocking (event-based) completion notification - all call 
sites are notified of item completion via a `BatchWorkCompleted` event

Happy to provide more details on the above if interested.

#Upcoming
The introduction of the `BatchWorker` by itself was not enough to substantially 
improve the MTTA. It, however, paves the way for the next phase of scheduling 
perf improvement - taking more than 1 task from a given `TaskGroup` in a single 
scheduling round (coming soon). That improvement wouldn't deliver without 
decreasing the lock contention first. 

Note: it wasn't easy to have a clean diff split, so some functionality in 
`BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
patch but will become obvious in the part 2 (coming out shortly).  

[1] - 
https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
4a7ef0b1b90607f68d89fe8e207f42c42a8c56a0 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f07746c2b990c1c2235e99f9e4775fc84f9c27b1 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java 
bbd971a2aa8a96cf79edd879ad60e1bebd933d79 
  src/main/java/org/apache/aurora/scheduler/state/MaintenanceController.java 
3c7cda09ab292d696070ca4d9dfedb1f6f71b0fe 
  

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Maxim Khutornenko


> On Sept. 14, 2016, 6:36 p.m., Zameer Manji wrote:
> > This patch LGTM. Just a single concern here and some questions. I will be 
> > moving on to the subsequent patches shortly.
> > Please do not commit this until all parts are shipped. Also, please don't 
> > forget to rebase/update subsequent patches if changes are made here.
> > 
> > Also, does this cover all state change consumers that interact on storage 
> > when the event is fired or is this a subset? Just wondering if this is 
> > supposed to apply to all of them or just a small set that you think need 
> > this change.

> Also, does this cover all state change consumers that interact on storage 
> when the event is fired or is this a subset

This covers all known high frequency consumers that require a write 
transaction. The list is derived by running a stack-dumping version of the 
`CallOrderEnforcingStorage.write()` in a live cluster for awhile and analysing 
top offenders. It may be not fully comprehensive though, so if you find any 
other candidates feel free to report them.


> On Sept. 14, 2016, 6:36 p.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 233
> > 
> >
> > Instead of having an `Optional` and then later 
> > checking it's present when we compute the backoff, we could simplify the 
> > internal implementation by having non repeatable work items default to some 
> > sort of `NoopBackoffStrategy` instead.

I don't really see how that would be a better solution in this particular case. 
We do need a `BackoffStrategy` in the repeatable work case it better be 
non-fake (and hard fail if not provided). Having a noop strategy softens that 
constraint.


> On Sept. 14, 2016, 6:36 p.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/SchedulerModule.java, line 69
> > 
> >
> > Just curious, is this default arbitrary or something that was derrived 
> > from your deployment in production?

It's semi-arbitrary but proven in production :).


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148950
---


On Sept. 14, 2016, 5:25 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 5:25 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148950
---


Fix it, then Ship it!




This patch LGTM. Just a single concern here and some questions. I will be 
moving on to the subsequent patches shortly.
Please do not commit this until all parts are shipped. Also, please don't 
forget to rebase/update subsequent patches if changes are made here.

Also, does this cover all state change consumers that interact on storage when 
the event is fired or is this a subset? Just wondering if this is supposed to 
apply to all of them or just a small set that you think need this change.


src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 233)


Instead of having an `Optional` and then later checking 
it's present when we compute the backoff, we could simplify the internal 
implementation by having non repeatable work items default to some sort of 
`NoopBackoffStrategy` instead.



src/main/java/org/apache/aurora/scheduler/SchedulerModule.java (line 69)


Just curious, is this default arbitrary or something that was derrived from 
your deployment in production?


- Zameer Manji


On Sept. 14, 2016, 10:25 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 10:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Zameer Manji


> On Sept. 13, 2016, 6:29 p.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 110
> > 
> >
> > The documentation implies we are returning a boolean but we are 
> > returning a `Result` object. I think you need to update that. I think what 
> > you are saying is that the `Result` object signals through `isCompleted` 
> > that the work should be repeated right?
> > 
> > (As an aside, maybe we should name this to `shouldReschedule` or 
> > similar).
> > 
> > 
> > If work needs to be repeated once the result has been computed, why 
> > can't the caller just access the `BatchWorker` in the `Work` it submits 
> > and do that itself?
> > 
> > That is why can't we extend `apply` to also accept a `BatchWorker` and 
> > the caller can submit another task within the `Work` if that's required?
> > 
> > Alternatively now that `execute` returns a `CompletableFuture`, the 
> > caller could use one of the many methods to execute code when the future is 
> > complete and have that extra code add more work.
> 
> Maxim Khutornenko wrote:
> Fixed the javadoc and converted `NoResult` into a static field to 
> simplify callsite consumption.
> 
> As for the second suggestion, I don't like disseminating `BatchWorkers` 
> or work item state maintenance outside the `BatchWorker` itself. That would 
> put the burden of maintaining and passing around the state to the callsite 
> and may create hard to reason/troubleshoot use cases. Consider the 
> https://reviews.apache.org/r/51763. We would have to use an explicit wrapper 
> object to hold on to local data to be able to re-queue it again as the 
> trigger context is gone the moment `BatchWorker` call is placed. Also, 
> queueing recurrent homogenuous events via `CompletableStage` chaining does 
> not strike me as the right use of the functional style.

Understood.

I spent some time about an alternative and I didn't get one that seemed to work 
well so I'll accept this design. Most solutions I looked that involved 
recursive closures. They would look like `queue work` -> `add call back on the 
future` -> `call back calls back into queuing work`. I think that's pretty 
messy and something that could be touched upon later.


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148837
---


On Sept. 14, 2016, 10:25 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 10:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148936
---



Master (5069f93) is red with this patch.
  ./build-support/jenkins/build.sh


:processTestResources
:testClasses
:checkstyleTest
:findbugsJmh
:findbugsMain
:findbugsTest
:licenseJmh UP-TO-DATE
:licenseMain UP-TO-DATE
:licenseTest UP-TO-DATE
:license UP-TO-DATE
:pmdJmh
:pmdMain
:pmdTest
:test

org.apache.aurora.scheduler.BatchWorkerTest > testExecute FAILED
java.lang.AssertionError at BatchWorkerTest.java:64
I0914 17:38:20.013 [ShutdownHook, SchedulerMain:101] Stopping scheduler 
services. 

1041 tests completed, 1 failed, 2 skipped
:test FAILED
:jacocoTestReport
Coverage report generated: 
file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/jacoco/test/html/index.html
:analyzeReport
Instruction coverage is 0.8872853310356982, but must be greater than 0.89
Branch coverage is 0.7994338859684593, but must be greater than 0.835

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> There were failing tests. See the report at: 
> file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/tests/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output.

BUILD FAILED

Total time: 8 mins 12.185 secs


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 14, 2016, 5:25 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 5:25 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/
---

(Updated Sept. 14, 2016, 5:25 p.m.)


Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Changes
---

Renaming `readyItems` to `batch`.


Repository: aurora


Description
---

This is the first (out of 3) patches intending to reduce storage write lock 
contention and as such improve overall system write throughput. It introduces 
the `BatchWorker` and migrates the majority of storage writes due to task 
status change events to use `TaskEventBatchWorker`.

#Problem
Our current storage system writes effectively behave as `SERIALIZABLE` 
transaction isolation level in SQL terms. This means all writes require 
exclusive access to the storage and no two transactions can happen in parallel 
[1]. While it certainly simplifies our implementation, it creates a single 
hotspot where multiple threads are competing for the storage write access. This 
type of contention only worsens as the cluster size grows, more tasks are 
scheduled, more status updates are processed, more subscribers are listening to 
status updates and etc. Eventually, the scheduler throughput (and especially 
task scheduling) becomes degraded to the extent that certain operations wait 
much longer (4x and more) for the lock acquisition than it takes to process 
their payload when inside the transaction. Some ops (like event processing) are 
generally tolerant of these types of delays. Others - not as much. The task 
scheduling suffers the most as backing up the scheduling queue directly affects 
t
 he Median Time To Assigned (MTTA).

#Remediation
Given the above, it's natural to assume that reducing the number of write 
transactions should help reducing the lock contention. This patch introduces a 
generic `BatchWorker` service that delivers a "best effort" batching approach 
by redirecting multiple individual write requests into a single FIFO queue 
served non-stop by a single dedicated thread. Every batch shares a single write 
transaction thus reducing the number of potential write lock requests. To 
minimize wait-in-queue time, items are dispatched immediately and the max 
number of items is bounded. There are a few `BatchWorker` instances specialized 
on particular workload types: task even processing, cron scheduling and task 
scheduling. Every instance can be tuned independently (max batch size) and 
provides specialized metrics helping to monitor each workload type perf.

#Results
The proposed approach has been heavily tested in production and delivered the 
best results. The lock contention latencies got down between 2x and 5x 
depending on the cluster load. A number of other approaches tried but discarded 
as not performing well or even performing much worse than the current master:
- Clock-driven batch execution - every batch is dispatched on a time schedule
- Max batch with a deadline - a batch is dispatched when max size is reached OR 
a timeout expires
- Various combinations of the above - some `BatchWorkers` are using 
clock-driven execution while others are using max batch with a deadline
- Completely non-blocking (event-based) completion notification - all call 
sites are notified of item completion via a `BatchWorkCompleted` event

Happy to provide more details on the above if interested.

#Upcoming
The introduction of the `BatchWorker` by itself was not enough to substantially 
improve the MTTA. It, however, paves the way for the next phase of scheduling 
perf improvement - taking more than 1 task from a given `TaskGroup` in a single 
scheduling round (coming soon). That improvement wouldn't deliver without 
decreasing the lock contention first. 

Note: it wasn't easy to have a clean diff split, so some functionality in 
`BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
patch but will become obvious in the part 2 (coming out shortly).  

[1] - 
https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
4a7ef0b1b90607f68d89fe8e207f42c42a8c56a0 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f07746c2b990c1c2235e99f9e4775fc84f9c27b1 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java 
bbd971a2aa8a96cf79edd879ad60e1bebd933d79 
  src/main/java/org/apache/aurora/scheduler/state/MaintenanceController.java 
3c7cda09ab292d696070ca4d9dfedb1f6f71b0fe 
  
src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java 
594bb6219294dcc77d48dcad14e2a6f9caa0c534 
  src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148931
---


Ship it!




Master (5069f93) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 14, 2016, 4:47 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 4:47 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but will become obvious in the part 2 (coming out shortly).  
> 
> [1] - 
> https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
>   

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Maxim Khutornenko


> On Sept. 14, 2016, 4:56 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java, 
> > lines 76-90
> > 
> >
> > This code is exposing a slight design smell in my eye: We are using the 
> > `DelayedExecutor` to trigger the `batchWorker` when a backoff timer has 
> > passed. However, the latter already provides an internal backoff mechanism.
> > 
> > I'd rather see if we have just one defined way how to handle those 
> > time-based backoffs:
> > 
> > * either the BatchWorker does not perform any backoff handling at all. 
> > This seems to be in line with what Zameer has proposed in his last comment,
> > * or we make the backoff mechanism a first-class feature of the 
> > batchWorker so that I can say I can inject `backoffStrategy` and 
> > `lastBackoffMsec` to be used for the `WorkItem` created in 
> > `batchWorker.execute`.
> 
> Stephan Erb wrote:
> The same applies to the `startGroup(final TaskGroup group)` in Part 3: A 
> backoff mechanism is used that calls the batchWorker that could provide a 
> backoff on its own.

The `BatchWorker` is currently supporting only 2 use cases:
1. Simple no-repeat work item execution (`BatchWorker.execute()`)
2. Repeatable work item execution (`BatchWorker.executeWithReplay()`)

I don't see why `BatchWorker` has to use `BackoffStrategy` everywhere or not 
use it at all. There is a clear use case for it in #2 and absolutely no purpose 
in #1.

What you are suggesting adds yet another case of something like 
`BatchWorker.scheduleExecution()`, which goes beyond of what's immediately 
necessary to address the above cited goals. I don't mind considering further 
improvements to the `BatchWorker` outside of this patch scope but I'd like to 
keep it concise to avoid feature creep.


> On Sept. 14, 2016, 4:56 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 228
> > 
> >
> > Do you have any particular usecase for this metric? 
> > 
> > We already have `batchesProcessed` and `itemsProcessed` and can compute 
> > the rate of processed items and batches. This seems more robust to me than 
> > just looking at `lastBatchSize`.

It's more of a convenience metric, similar to how `TimedInterceptor` exposes 
both `total_nanos` and `total_events` in addition to `events_per_sec`. If you 
feel strong about removing redundancy here I am ok dropping it. Just thought 
having a single metric to track the effective batch size may be helpful.


> On Sept. 14, 2016, 4:56 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 197
> > 
> >
> > `readyItems` is still a left over from the previous patch version. 
> > Could be renamed to just `batch`.

Thanks. Fixed.


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148918
---


On Sept. 14, 2016, 4:47 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 4:47 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Stephan Erb


> On Sept. 14, 2016, 6:56 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java, 
> > lines 76-90
> > 
> >
> > This code is exposing a slight design smell in my eye: We are using the 
> > `DelayedExecutor` to trigger the `batchWorker` when a backoff timer has 
> > passed. However, the latter already provides an internal backoff mechanism.
> > 
> > I'd rather see if we have just one defined way how to handle those 
> > time-based backoffs:
> > 
> > * either the BatchWorker does not perform any backoff handling at all. 
> > This seems to be in line with what Zameer has proposed in his last comment,
> > * or we make the backoff mechanism a first-class feature of the 
> > batchWorker so that I can say I can inject `backoffStrategy` and 
> > `lastBackoffMsec` to be used for the `WorkItem` created in 
> > `batchWorker.execute`.

The same applies to the `startGroup(final TaskGroup group)` in Part 3: A 
backoff mechanism is used that calls the batchWorker that could provide a 
backoff on its own.


- Stephan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148918
---


On Sept. 14, 2016, 6:47 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 6:47 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148918
---


Fix it, then Ship it!




LGTM.

A couple of smaller points below, with the largest being the somewhat blurry 
handling of time-based executions.


src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 197)


`readyItems` is still a left over from the previous patch version. Could be 
renamed to just `batch`.



src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 228)


Do you have any particular usecase for this metric? 

We already have `batchesProcessed` and `itemsProcessed` and can compute the 
rate of processed items and batches. This seems more robust to me than just 
looking at `lastBatchSize`.



src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java (lines 
73 - 87)


This code is exposing a slight design smell in my eye: We are using the 
`DelayedExecutor` to trigger the `batchWorker` when a backoff timer has passed. 
However, the latter already provides an internal backoff mechanism.

I'd rather see if we have just one defined way how to handle those 
time-based backoffs:

* either the BatchWorker does not perform any backoff handling at all. This 
seems to be in line with what Zameer has proposed in his last comment,
* or we make the backoff mechanism a first-class feature of the batchWorker 
so that I can say I can inject `backoffStrategy` and `lastBackoffMsec` to be 
used for the `WorkItem` created in `batchWorker.execute`.


- Stephan Erb


On Sept. 14, 2016, 6:47 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 6:47 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/
---

(Updated Sept. 14, 2016, 4:47 p.m.)


Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Changes
---

Javadoc changes and minor refactoring.


Repository: aurora


Description
---

This is the first (out of 3) patches intending to reduce storage write lock 
contention and as such improve overall system write throughput. It introduces 
the `BatchWorker` and migrates the majority of storage writes due to task 
status change events to use `TaskEventBatchWorker`.

#Problem
Our current storage system writes effectively behave as `SERIALIZABLE` 
transaction isolation level in SQL terms. This means all writes require 
exclusive access to the storage and no two transactions can happen in parallel 
[1]. While it certainly simplifies our implementation, it creates a single 
hotspot where multiple threads are competing for the storage write access. This 
type of contention only worsens as the cluster size grows, more tasks are 
scheduled, more status updates are processed, more subscribers are listening to 
status updates and etc. Eventually, the scheduler throughput (and especially 
task scheduling) becomes degraded to the extent that certain operations wait 
much longer (4x and more) for the lock acquisition than it takes to process 
their payload when inside the transaction. Some ops (like event processing) are 
generally tolerant of these types of delays. Others - not as much. The task 
scheduling suffers the most as backing up the scheduling queue directly affects 
t
 he Median Time To Assigned (MTTA).

#Remediation
Given the above, it's natural to assume that reducing the number of write 
transactions should help reducing the lock contention. This patch introduces a 
generic `BatchWorker` service that delivers a "best effort" batching approach 
by redirecting multiple individual write requests into a single FIFO queue 
served non-stop by a single dedicated thread. Every batch shares a single write 
transaction thus reducing the number of potential write lock requests. To 
minimize wait-in-queue time, items are dispatched immediately and the max 
number of items is bounded. There are a few `BatchWorker` instances specialized 
on particular workload types: task even processing, cron scheduling and task 
scheduling. Every instance can be tuned independently (max batch size) and 
provides specialized metrics helping to monitor each workload type perf.

#Results
The proposed approach has been heavily tested in production and delivered the 
best results. The lock contention latencies got down between 2x and 5x 
depending on the cluster load. A number of other approaches tried but discarded 
as not performing well or even performing much worse than the current master:
- Clock-driven batch execution - every batch is dispatched on a time schedule
- Max batch with a deadline - a batch is dispatched when max size is reached OR 
a timeout expires
- Various combinations of the above - some `BatchWorkers` are using 
clock-driven execution while others are using max batch with a deadline
- Completely non-blocking (event-based) completion notification - all call 
sites are notified of item completion via a `BatchWorkCompleted` event

Happy to provide more details on the above if interested.

#Upcoming
The introduction of the `BatchWorker` by itself was not enough to substantially 
improve the MTTA. It, however, paves the way for the next phase of scheduling 
perf improvement - taking more than 1 task from a given `TaskGroup` in a single 
scheduling round (coming soon). That improvement wouldn't deliver without 
decreasing the lock contention first. 

Note: it wasn't easy to have a clean diff split, so some functionality in 
`BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
patch but will become obvious in the part 2 (coming out shortly).  

[1] - 
https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
4a7ef0b1b90607f68d89fe8e207f42c42a8c56a0 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f07746c2b990c1c2235e99f9e4775fc84f9c27b1 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java 
bbd971a2aa8a96cf79edd879ad60e1bebd933d79 
  src/main/java/org/apache/aurora/scheduler/state/MaintenanceController.java 
3c7cda09ab292d696070ca4d9dfedb1f6f71b0fe 
  
src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java 
594bb6219294dcc77d48dcad14e2a6f9caa0c534 
  

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-14 Thread Maxim Khutornenko


> On Sept. 14, 2016, 1:29 a.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 110
> > 
> >
> > The documentation implies we are returning a boolean but we are 
> > returning a `Result` object. I think you need to update that. I think what 
> > you are saying is that the `Result` object signals through `isCompleted` 
> > that the work should be repeated right?
> > 
> > (As an aside, maybe we should name this to `shouldReschedule` or 
> > similar).
> > 
> > 
> > If work needs to be repeated once the result has been computed, why 
> > can't the caller just access the `BatchWorker` in the `Work` it submits 
> > and do that itself?
> > 
> > That is why can't we extend `apply` to also accept a `BatchWorker` and 
> > the caller can submit another task within the `Work` if that's required?
> > 
> > Alternatively now that `execute` returns a `CompletableFuture`, the 
> > caller could use one of the many methods to execute code when the future is 
> > complete and have that extra code add more work.

Fixed the javadoc and converted `NoResult` into a static field to simplify 
callsite consumption.

As for the second suggestion, I don't like disseminating `BatchWorkers` or work 
item state maintenance outside the `BatchWorker` itself. That would put the 
burden of maintaining and passing around the state to the callsite and may 
create hard to reason/troubleshoot use cases. Consider the 
https://reviews.apache.org/r/51763. We would have to use an explicit wrapper 
object to hold on to local data to be able to re-queue it again as the trigger 
context is gone the moment `BatchWorker` call is placed. Also, queueing 
recurrent homogenuous events via `CompletableStage` chaining does not strike me 
as the right use of the functional style.


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148837
---


On Sept. 14, 2016, 12:16 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 12:16 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-13 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148837
---




src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 110)


The documentation implies we are returning a boolean but we are returning a 
`Result` object. I think you need to update that. I think what you are saying 
is that the `Result` object signals through `isCompleted` that the work should 
be repeated right?

(As an aside, maybe we should name this to `shouldReschedule` or similar).

If work needs to be repeated once the result has been computed, why can't 
the caller just access the `BatchWorker` in the `Work` it submits and do 
that itself?

That is why can't we extend `apply` to also accept a `BatchWorker` and the 
caller can submit another task within the `Work` if that's required?

Alternatively now that `execute` returns a `CompletableFuture`, the caller 
could use one of the many methods to execute code when the future is complete 
and have that extra code add more work.


- Zameer Manji


On Sept. 13, 2016, 5:16 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 13, 2016, 5:16 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-13 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148823
---


Ship it!




Master (5069f93) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 14, 2016, 12:16 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 12:16 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but will become obvious in the part 2 (coming out shortly).  
> 
> [1] - 
> https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
>   

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-13 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/
---

(Updated Sept. 14, 2016, 12:16 a.m.)


Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Changes
---

Stephan's comments (getting rid of getReadyItems()).


Repository: aurora


Description
---

This is the first (out of 3) patches intending to reduce storage write lock 
contention and as such improve overall system write throughput. It introduces 
the `BatchWorker` and migrates the majority of storage writes due to task 
status change events to use `TaskEventBatchWorker`.

#Problem
Our current storage system writes effectively behave as `SERIALIZABLE` 
transaction isolation level in SQL terms. This means all writes require 
exclusive access to the storage and no two transactions can happen in parallel 
[1]. While it certainly simplifies our implementation, it creates a single 
hotspot where multiple threads are competing for the storage write access. This 
type of contention only worsens as the cluster size grows, more tasks are 
scheduled, more status updates are processed, more subscribers are listening to 
status updates and etc. Eventually, the scheduler throughput (and especially 
task scheduling) becomes degraded to the extent that certain operations wait 
much longer (4x and more) for the lock acquisition than it takes to process 
their payload when inside the transaction. Some ops (like event processing) are 
generally tolerant of these types of delays. Others - not as much. The task 
scheduling suffers the most as backing up the scheduling queue directly affects 
t
 he Median Time To Assigned (MTTA).

#Remediation
Given the above, it's natural to assume that reducing the number of write 
transactions should help reducing the lock contention. This patch introduces a 
generic `BatchWorker` service that delivers a "best effort" batching approach 
by redirecting multiple individual write requests into a single FIFO queue 
served non-stop by a single dedicated thread. Every batch shares a single write 
transaction thus reducing the number of potential write lock requests. To 
minimize wait-in-queue time, items are dispatched immediately and the max 
number of items is bounded. There are a few `BatchWorker` instances specialized 
on particular workload types: task even processing, cron scheduling and task 
scheduling. Every instance can be tuned independently (max batch size) and 
provides specialized metrics helping to monitor each workload type perf.

#Results
The proposed approach has been heavily tested in production and delivered the 
best results. The lock contention latencies got down between 2x and 5x 
depending on the cluster load. A number of other approaches tried but discarded 
as not performing well or even performing much worse than the current master:
- Clock-driven batch execution - every batch is dispatched on a time schedule
- Max batch with a deadline - a batch is dispatched when max size is reached OR 
a timeout expires
- Various combinations of the above - some `BatchWorkers` are using 
clock-driven execution while others are using max batch with a deadline
- Completely non-blocking (event-based) completion notification - all call 
sites are notified of item completion via a `BatchWorkCompleted` event

Happy to provide more details on the above if interested.

#Upcoming
The introduction of the `BatchWorker` by itself was not enough to substantially 
improve the MTTA. It, however, paves the way for the next phase of scheduling 
perf improvement - taking more than 1 task from a given `TaskGroup` in a single 
scheduling round (coming soon). That improvement wouldn't deliver without 
decreasing the lock contention first. 

Note: it wasn't easy to have a clean diff split, so some functionality in 
`BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
patch but will become obvious in the part 2 (coming out shortly).  

[1] - 
https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
4a7ef0b1b90607f68d89fe8e207f42c42a8c56a0 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f07746c2b990c1c2235e99f9e4775fc84f9c27b1 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java 
bbd971a2aa8a96cf79edd879ad60e1bebd933d79 
  src/main/java/org/apache/aurora/scheduler/state/MaintenanceController.java 
3c7cda09ab292d696070ca4d9dfedb1f6f71b0fe 
  
src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java 
594bb6219294dcc77d48dcad14e2a6f9caa0c534 
  

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-13 Thread Maxim Khutornenko


> On Sept. 11, 2016, 11:02 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, lines 257-259
> > 
> >
> > 1) Doesn't that effectively result in a busy loop where we are queuing 
> > the same item over and over until it is can finally be executed? This 
> > sounds rather problematic.
> > 
> > 2) This changes the order of work items. Are you sure that no storage 
> > client requires writes to be in order?
> 
> Maxim Khutornenko wrote:
> > 1) Doesn't that effectively result in a busy loop where we are queuing 
> the same item over and over until it is can finally be executed? This sounds 
> rather problematic.
> 
> There is no busy loop in a pure sense as every `RepeatableWork` item must 
> have a backoff strategy associated with it. Any item scheduled for 
> re-processing is delayed according to the supplied backoff.
> 
> > 2) This changes the order of work items. Are you sure that no storage 
> client requires writes to be in order?
> 
> Order has never been implied or expected as far as cross-transaction 
> boundaries go. That said, the order is preserved for a given work type. We 
> re-queue incomplete items in the same order they were received.
> 
> Stephan Erb wrote:
> Regarding question 1)
> 
> It seems to me that the `run()` loop will drain a batch of WorkItems from 
> the `workQueue`. We then call `getReadyItems` and re-queue everything that is 
> currently in backoff. This repeats over and over again until the backoff time 
> has passed. 
> 
> Have you considered using a `@AsyncExecutor DelayExecutor` just like in 
> `TaskThrottler` for everything that is in backoff and cannot be executed 
> immediately?

In a real cluster, there is no shortage of items pending evaluation. As such, 
the FOLLOWUP item "tax" is almost never a problem as BatchWorker will always be 
in a "busy loop" processing new and old items. I can see how this can become 
too wasteful in a really small cluster with not too many repeatable work items 
though.

I just realized how I can improve on the above and simplify the work queue 
processing. Instead of relying on a queue to drive backoff, we can use backoff 
as a delay on re-queueing incomplete items. This will let us get rid of the 
`getReadyItems()` entirely. Thanks for bringing this up!


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148433
---


On Sept. 13, 2016, 11:14 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 13, 2016, 11:14 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-13 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148806
---


Ship it!




Master (633948a) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 13, 2016, 11:14 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 13, 2016, 11:14 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but will become obvious in the part 2 (coming out shortly).  
> 
> [1] - 
> https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
>   

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-13 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/
---

(Updated Sept. 13, 2016, 11:14 p.m.)


Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Changes
---

Zameer's comments (Switching from custom result type to CompletableFuture).


Repository: aurora


Description
---

This is the first (out of 3) patches intending to reduce storage write lock 
contention and as such improve overall system write throughput. It introduces 
the `BatchWorker` and migrates the majority of storage writes due to task 
status change events to use `TaskEventBatchWorker`.

#Problem
Our current storage system writes effectively behave as `SERIALIZABLE` 
transaction isolation level in SQL terms. This means all writes require 
exclusive access to the storage and no two transactions can happen in parallel 
[1]. While it certainly simplifies our implementation, it creates a single 
hotspot where multiple threads are competing for the storage write access. This 
type of contention only worsens as the cluster size grows, more tasks are 
scheduled, more status updates are processed, more subscribers are listening to 
status updates and etc. Eventually, the scheduler throughput (and especially 
task scheduling) becomes degraded to the extent that certain operations wait 
much longer (4x and more) for the lock acquisition than it takes to process 
their payload when inside the transaction. Some ops (like event processing) are 
generally tolerant of these types of delays. Others - not as much. The task 
scheduling suffers the most as backing up the scheduling queue directly affects 
t
 he Median Time To Assigned (MTTA).

#Remediation
Given the above, it's natural to assume that reducing the number of write 
transactions should help reducing the lock contention. This patch introduces a 
generic `BatchWorker` service that delivers a "best effort" batching approach 
by redirecting multiple individual write requests into a single FIFO queue 
served non-stop by a single dedicated thread. Every batch shares a single write 
transaction thus reducing the number of potential write lock requests. To 
minimize wait-in-queue time, items are dispatched immediately and the max 
number of items is bounded. There are a few `BatchWorker` instances specialized 
on particular workload types: task even processing, cron scheduling and task 
scheduling. Every instance can be tuned independently (max batch size) and 
provides specialized metrics helping to monitor each workload type perf.

#Results
The proposed approach has been heavily tested in production and delivered the 
best results. The lock contention latencies got down between 2x and 5x 
depending on the cluster load. A number of other approaches tried but discarded 
as not performing well or even performing much worse than the current master:
- Clock-driven batch execution - every batch is dispatched on a time schedule
- Max batch with a deadline - a batch is dispatched when max size is reached OR 
a timeout expires
- Various combinations of the above - some `BatchWorkers` are using 
clock-driven execution while others are using max batch with a deadline
- Completely non-blocking (event-based) completion notification - all call 
sites are notified of item completion via a `BatchWorkCompleted` event

Happy to provide more details on the above if interested.

#Upcoming
The introduction of the `BatchWorker` by itself was not enough to substantially 
improve the MTTA. It, however, paves the way for the next phase of scheduling 
perf improvement - taking more than 1 task from a given `TaskGroup` in a single 
scheduling round (coming soon). That improvement wouldn't deliver without 
decreasing the lock contention first. 

Note: it wasn't easy to have a clean diff split, so some functionality in 
`BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
patch but will become obvious in the part 2 (coming out shortly).  

[1] - 
https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
4a7ef0b1b90607f68d89fe8e207f42c42a8c56a0 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f07746c2b990c1c2235e99f9e4775fc84f9c27b1 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java 
bbd971a2aa8a96cf79edd879ad60e1bebd933d79 
  src/main/java/org/apache/aurora/scheduler/state/MaintenanceController.java 
3c7cda09ab292d696070ca4d9dfedb1f6f71b0fe 
  
src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java 
594bb6219294dcc77d48dcad14e2a6f9caa0c534 
  

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-13 Thread Stephan Erb


> On Sept. 12, 2016, 1:02 a.m., Stephan Erb wrote:
> > Thanks for the excellent write up and the time you took to split the review 
> > into sizable chunks. A full review will take some time. I will start with a 
> > couple of highlevel questions.
> > 
> > a) Do you have some results of the jmh benachmarks that you might share?
> > 
> > b) If I understand this correctly, you are basically trading one lock with 
> > another: Instead of waiting on the global storage lock, threads do now have 
> > to wait on the lock of the `workQueue`. Obviously waiting on that lock is 
> > much faster as there is less work being done in the critical region 
> > protected by that lock. However, isn't that a strategy that we could try to 
> > use as well? For details, see my comment in Part 3.
> 
> Maxim Khutornenko wrote:
> > a) Do you have some results of the jmh benachmarks that you might share?
> 
> Good question. Creating a reliable benchmark to simulate a real world 
> lock contention is next to impossible. JMH guidelines warn the accuracy of 
> benchmarking degrades guickly with the number of threads involved. The best 
> we could prove here is that the overhead introduced by the `BatchWorker` is 
> not prohibitive:
> 
> Master:
> ```
> Benchmark 
>  Mode  Cnt ScoreError  Units
> 
> SchedulingBenchmarks.LimitConstraintMismatchSchedulingBenchmark.runBenchmark  
> thrpt   10  2714.177 ± 84.900  ops/s
> ```
> 
> Part 3 (https://reviews.apache.org/r/51765/):
> ```
> Benchmark 
>  Mode  Cnt ScoreError  Units
> 
> SchedulingBenchmarks.LimitConstraintMismatchSchedulingBenchmark.runBenchmark  
> thrpt   10  2664.389 ± 65.013  ops/s
> ```
> 
> The results are well within the margin of error and both are a few orders 
> of magnitude higher than the real scheduling perf possible/observed.
> 
> > b) If I understand this correctly, you are basically trading one lock 
> with another...
> 
> The major benefit here is improving the response time by reducing the 
> hotspot contention. According to Little's Law [1]:
> ```
> MeanResponseTime = MeanNumberInSystem / MeanThroughput
> ```
> 
> Given the fixed throughput (native log perf), the only way to reduce 
> queue saturation (and as such improve response time) is to reduce the number 
> of the outstanding requests. Batching multiple individual requests 
> accomplishes exactly that goal. It's worth noting that the above formula is 
> applicable when all requests take about the same time to process. That is 
> *not* the case in our system. Some take longer (task scheduling), others are 
> very low latency (single tasks status processing). Varying the batch size of 
> different types of requests (a few scheduling requests vs. hundreds of task 
> events) lets us level out the request shape (as much as it's physically 
> possible) and as such improve overall perf.
> 
> As for reducing the critical section weight, I'll address it in the Part 
> 3 comment. In general though, reducing the latency helps to level out the 
> request shape but is still not sufficient to reduce the hotspot (see above).
> 
> [1] - https://en.wikipedia.org/wiki/Little%27s_law

Oh, I wasn't aware that the native log limits your scheduling throughput. My 
uninformed guess was that you'd spend the most time looping over those O(1) 
host offers trying to find a suitable one, rather than appending to the native 
log.

Batching requests then makes sense to me.


> On Sept. 12, 2016, 1:02 a.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 58
> > 
> >
> > I am wondering if this queue shouldn't have a fixed capacity, so that 
> > we gain some kind of back pressure throughout the system.
> 
> Maxim Khutornenko wrote:
> Not sure I follow. The default `LinkedBlockingQueue` constructor uses 
> `Integer.MAX_VALUE` for the max number of items in the queue. That should be 
> more than sufficient for our use case.

Scrape that comment of mine. I feared that this queue could grow without 
bounds, but this is probably nothing to be concerned about.


> On Sept. 12, 2016, 1:02 a.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, lines 240-247
> > 
> >
> > A queue could hold vital information that we should not loose during 
> > scheduler failover (in particular, as we do this once per day).
> > 
> > Would it be possible to drain the queue completely before exiting here?
> 
> Maxim Khutornenko wrote:
> This is no different than terminating the main process while threads are 
> waiting on a write lock. Transaction 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-12 Thread Maxim Khutornenko


> On Sept. 13, 2016, 1:14 a.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 188
> > 
> >
> > Our current metrics use 'nanos' to indicate nanoseconds instead of 
> > 'ns'. Mind using that here for consistency?

We actually have both "nanos" and "ns". I agree that "nanos" looks more 
prevalent as it's what `TimedInterceptor` is using.


> On Sept. 13, 2016, 1:14 a.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 105
> > 
> >
> > I agree we cannot use `java.util.concurrent.FutureTask` here. However, 
> > this seems to be pretty close to 
> > `java.util.concurrent.CompletableFuture` (new in Java 8).
> > 
> > It's not clear to me here if you have investigated it or not, but if 
> > you have I would like a brief explanation as to why it is not suitable. If 
> > you haven't please take a look at it as it seems to provide all of the 
> > funtionality you need.
> > 
> > Further if it is possible to use it, then we do not need pubsub for 
> > notification for when work completes, if item is it's own future, we can 
> > use the `CompletionStage` functionality to execute code after a future has 
> > been fulfilled. 
> > 
> > It has a `get()` method that can replace `waitForResult()` and a 
> > `complete()` method that allows you to set the value.

I did evaluate the `CompletableFuture` for the initial version of the 
`BatchWorker` and did not find it up to the task. Looking at it again though, I 
don't see any reasons why it would not satisfy the current version! Thanks for 
the nudge, the diff is incoming.


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148633
---


On Sept. 12, 2016, 10:51 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 12, 2016, 10:51 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-12 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148633
---



I noticed that `CompletableFuture` might be suitable here. I will hold off a 
deeper review until I'm convinced that it is not suitable. I feel if it were 
possible to adopt and it was adopted here it could simplify the design.

Overall, I like the approach of queueing up multiple writes together and then 
executing them together inside the same storage lock. I think the interface is 
suitable for this purpose.


src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 105)


I agree we cannot use `java.util.concurrent.FutureTask` here. However, this 
seems to be pretty close to `java.util.concurrent.CompletableFuture` (new in 
Java 8).

It's not clear to me here if you have investigated it or not, but if you 
have I would like a brief explanation as to why it is not suitable. If you 
haven't please take a look at it as it seems to provide all of the funtionality 
you need.

Further if it is possible to use it, then we do not need pubsub for 
notification for when work completes, if item is it's own future, we can use 
the `CompletionStage` functionality to execute code after a future has been 
fulfilled. 

It has a `get()` method that can replace `waitForResult()` and a 
`complete()` method that allows you to set the value.



src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 188)


Our current metrics use 'nanos' to indicate nanoseconds instead of 'ns'. 
Mind using that here for consistency?


- Zameer Manji


On Sept. 12, 2016, 3:51 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 12, 2016, 3:51 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-12 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148625
---


Ship it!




Master (c4903d8) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 12, 2016, 10:51 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 12, 2016, 10:51 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but will become obvious in the part 2 (coming out shortly).  
> 
> [1] - 
> https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
>   

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-12 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/
---

(Updated Sept. 12, 2016, 10:51 p.m.)


Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Changes
---

Joshua's comments.


Repository: aurora


Description
---

This is the first (out of 3) patches intending to reduce storage write lock 
contention and as such improve overall system write throughput. It introduces 
the `BatchWorker` and migrates the majority of storage writes due to task 
status change events to use `TaskEventBatchWorker`.

#Problem
Our current storage system writes effectively behave as `SERIALIZABLE` 
transaction isolation level in SQL terms. This means all writes require 
exclusive access to the storage and no two transactions can happen in parallel 
[1]. While it certainly simplifies our implementation, it creates a single 
hotspot where multiple threads are competing for the storage write access. This 
type of contention only worsens as the cluster size grows, more tasks are 
scheduled, more status updates are processed, more subscribers are listening to 
status updates and etc. Eventually, the scheduler throughput (and especially 
task scheduling) becomes degraded to the extent that certain operations wait 
much longer (4x and more) for the lock acquisition than it takes to process 
their payload when inside the transaction. Some ops (like event processing) are 
generally tolerant of these types of delays. Others - not as much. The task 
scheduling suffers the most as backing up the scheduling queue directly affects 
t
 he Median Time To Assigned (MTTA).

#Remediation
Given the above, it's natural to assume that reducing the number of write 
transactions should help reducing the lock contention. This patch introduces a 
generic `BatchWorker` service that delivers a "best effort" batching approach 
by redirecting multiple individual write requests into a single FIFO queue 
served non-stop by a single dedicated thread. Every batch shares a single write 
transaction thus reducing the number of potential write lock requests. To 
minimize wait-in-queue time, items are dispatched immediately and the max 
number of items is bounded. There are a few `BatchWorker` instances specialized 
on particular workload types: task even processing, cron scheduling and task 
scheduling. Every instance can be tuned independently (max batch size) and 
provides specialized metrics helping to monitor each workload type perf.

#Results
The proposed approach has been heavily tested in production and delivered the 
best results. The lock contention latencies got down between 2x and 5x 
depending on the cluster load. A number of other approaches tried but discarded 
as not performing well or even performing much worse than the current master:
- Clock-driven batch execution - every batch is dispatched on a time schedule
- Max batch with a deadline - a batch is dispatched when max size is reached OR 
a timeout expires
- Various combinations of the above - some `BatchWorkers` are using 
clock-driven execution while others are using max batch with a deadline
- Completely non-blocking (event-based) completion notification - all call 
sites are notified of item completion via a `BatchWorkCompleted` event

Happy to provide more details on the above if interested.

#Upcoming
The introduction of the `BatchWorker` by itself was not enough to substantially 
improve the MTTA. It, however, paves the way for the next phase of scheduling 
perf improvement - taking more than 1 task from a given `TaskGroup` in a single 
scheduling round (coming soon). That improvement wouldn't deliver without 
decreasing the lock contention first. 

Note: it wasn't easy to have a clean diff split, so some functionality in 
`BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
patch but will become obvious in the part 2 (coming out shortly).  

[1] - 
https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
4a7ef0b1b90607f68d89fe8e207f42c42a8c56a0 
  src/main/java/org/apache/aurora/scheduler/events/PubsubEvent.java 
70b5470b9dad1af838b5222cae5ac86487e2f2e4 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f07746c2b990c1c2235e99f9e4775fc84f9c27b1 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java 
bbd971a2aa8a96cf79edd879ad60e1bebd933d79 
  src/main/java/org/apache/aurora/scheduler/state/MaintenanceController.java 
3c7cda09ab292d696070ca4d9dfedb1f6f71b0fe 
  
src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-12 Thread Maxim Khutornenko


> On Sept. 12, 2016, 6:26 p.m., Joshua Cohen wrote:
> > Along the lines of the question Stephan raised, what happens in the event 
> > of a failover mid-batch, especially w.r.t. repeatable work?

Same as today: the transaction either happens or does not. In case of a cron 
job (the only user of `RepeatableWork` at the moment), the cron job will not be 
triggered (again, same as today).


> On Sept. 12, 2016, 6:26 p.m., Joshua Cohen wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, lines 106-107
> > 
> >
> > I think this should be "does not initialize"?

Good catch. Fixed.


> On Sept. 12, 2016, 6:26 p.m., Joshua Cohen wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, lines 257-259
> > 
> >
> > Even though there's only a single callsite where we initialize a 
> > `WorkItem` in `FOLLOWUP` state, there's no guarantee that `laskBackoffMsec` 
> > will be present just because the state is `FOLLOWUP`.
> > 
> > We should add some state checks to the `WorkItem` constructor so that 
> > if state is `FOLLOWUP` we'll throw if `lastBackoffMsec` is not present to 
> > protect against future mistakes? (Or if that's a valid scenario we should 
> > perform an `isPresent` check or use `or` here...).

I assert both `backoffStrategy` and `lastBackoffMsec` inside the `backoffFor()` 
method on line 315. I think it should be sufficient given that `WorkItem` is a 
private class?


> On Sept. 12, 2016, 6:26 p.m., Joshua Cohen wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, lines 273-275
> > 
> >
> > Should we do this after the batch is processed rather than before?

It's largely irrelevant but I guess it's more logical to expect it at the end 
while going through the code. Moved.


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148501
---


On Sept. 9, 2016, 5:29 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 9, 2016, 5:29 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-12 Thread Maxim Khutornenko


> On Sept. 11, 2016, 11:02 p.m., Stephan Erb wrote:
> > Thanks for the excellent write up and the time you took to split the review 
> > into sizable chunks. A full review will take some time. I will start with a 
> > couple of highlevel questions.
> > 
> > a) Do you have some results of the jmh benachmarks that you might share?
> > 
> > b) If I understand this correctly, you are basically trading one lock with 
> > another: Instead of waiting on the global storage lock, threads do now have 
> > to wait on the lock of the `workQueue`. Obviously waiting on that lock is 
> > much faster as there is less work being done in the critical region 
> > protected by that lock. However, isn't that a strategy that we could try to 
> > use as well? For details, see my comment in Part 3.

> a) Do you have some results of the jmh benachmarks that you might share?

Good question. Creating a reliable benchmark to simulate a real world lock 
contention is next to impossible. JMH guidelines warn the accuracy of 
benchmarking degrades guickly with the number of threads involved. The best we 
could prove here is that the overhead introduced by the `BatchWorker` is not 
prohibitive:

Master:
```
Benchmark  
Mode  Cnt ScoreError  Units
SchedulingBenchmarks.LimitConstraintMismatchSchedulingBenchmark.runBenchmark  
thrpt   10  2714.177 ± 84.900  ops/s
```

Part 3 (https://reviews.apache.org/r/51765/):
```
Benchmark  
Mode  Cnt ScoreError  Units
SchedulingBenchmarks.LimitConstraintMismatchSchedulingBenchmark.runBenchmark  
thrpt   10  2664.389 ± 65.013  ops/s
```

The results are well within the margin of error and both are a few orders of 
magnitude higher than the real scheduling perf possible/observed.

> b) If I understand this correctly, you are basically trading one lock with 
> another...

The major benefit here is improving the response time by reducing the hotspot 
contention. According to Little's Law [1]:
```
MeanResponseTime = MeanNumberInSystem / MeanThroughput
```

Given the fixed throughput (native log perf), the only way to reduce queue 
saturation (and as such improve response time) is to reduce the number of the 
outstanding requests. Batching multiple individual requests accomplishes 
exactly that goal. It's worth noting that the above formula is applicable when 
all requests take about the same time to process. That is *not* the case in our 
system. Some take longer (task scheduling), others are very low latency (single 
tasks status processing). Varying the batch size of different types of requests 
(a few scheduling requests vs. hundreds of task events) lets us level out the 
request shape (as much as it's physically possible) and as such improve overall 
perf.

As for reducing the critical section weight, I'll address it in the Part 3 
comment. In general though, reducing the latency helps to level out the request 
shape but is still not sufficient to reduce the hotspot (see above).

[1] - https://en.wikipedia.org/wiki/Little%27s_law


> On Sept. 11, 2016, 11:02 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 58
> > 
> >
> > I am wondering if this queue shouldn't have a fixed capacity, so that 
> > we gain some kind of back pressure throughout the system.

Not sure I follow. The default `LinkedBlockingQueue` constructor uses 
`Integer.MAX_VALUE` for the max number of items in the queue. That should be 
more than sufficient for our use case.


> On Sept. 11, 2016, 11:02 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 164
> > 
> >
> > `Work` is a subclass of `RepeatableWork` but meant to express something 
> > non-repeatable. This sounds like a violation of the [Liskov Substitution 
> > Principle]( https://en.wikipedia.org/wiki/Liskov_substitution_principle). 
> > 
> > In other words, I feel like `RepeatableWork` should extend `Work` and 
> > not the other way round.

LSP is a pricinple applicable to mutable types. These are interfaces. The 
`RepeatableWork` is a broader type that supports repetition. The `Work` type a 
convenience wrapper limiting to just one repetition. We use similar approach in 
`Storage`. See `StorageOperation` and the multitude of derived interfaces 
(`Quiet`, `NoResult` and etc.)


> On Sept. 11, 2016, 11:02 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, lines 240-247
> > 
> >
> > A queue could hold vital information that we should not loose during 
> > scheduler failover (in particular, as we do this once per day).
> > 
> > Would it be possible to 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-12 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148501
---



Along the lines of the question Stephan raised, what happens in the event of a 
failover mid-batch, especially w.r.t. repeatable work?


src/main/java/org/apache/aurora/scheduler/BatchWorker.java (lines 106 - 107)


I think this should be "does not initialize"?



src/main/java/org/apache/aurora/scheduler/BatchWorker.java (lines 257 - 259)


Even though there's only a single callsite where we initialize a `WorkItem` 
in `FOLLOWUP` state, there's no guarantee that `laskBackoffMsec` will be 
present just because the state is `FOLLOWUP`.

We should add some state checks to the `WorkItem` constructor so that if 
state is `FOLLOWUP` we'll throw if `lastBackoffMsec` is not present to protect 
against future mistakes? (Or if that's a valid scenario we should perform an 
`isPresent` check or use `or` here...).



src/main/java/org/apache/aurora/scheduler/BatchWorker.java (lines 273 - 275)


Should we do this after the batch is processed rather than before?


- Joshua Cohen


On Sept. 9, 2016, 5:29 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 9, 2016, 5:29 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-11 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148433
---



Thanks for the excellent write up and the time you took to split the review 
into sizable chunks. A full review will take some time. I will start with a 
couple of highlevel questions.

a) Do you have some results of the jmh benachmarks that you might share?

b) If I understand this correctly, you are basically trading one lock with 
another: Instead of waiting on the global storage lock, threads do now have to 
wait on the lock of the `workQueue`. Obviously waiting on that lock is much 
faster as there is less work being done in the critical region protected by 
that lock. However, isn't that a strategy that we could try to use as well? For 
details, see my comment in Part 3.


src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 58)


I am wondering if this queue shouldn't have a fixed capacity, so that we 
gain some kind of back pressure throughout the system.



src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 164)


`Work` is a subclass of `RepeatableWork` but meant to express something 
non-repeatable. This sounds like a violation of the [Liskov Substitution 
Principle]( https://en.wikipedia.org/wiki/Liskov_substitution_principle). 

In other words, I feel like `RepeatableWork` should extend `Work` and not 
the other way round.



src/main/java/org/apache/aurora/scheduler/BatchWorker.java (lines 240 - 247)


A queue could hold vital information that we should not loose during 
scheduler failover (in particular, as we do this once per day).

Would it be possible to drain the queue completely before exiting here?



src/main/java/org/apache/aurora/scheduler/BatchWorker.java (lines 257 - 259)


1) Doesn't that effectively result in a busy loop where we are queuing the 
same item over and over until it is can finally be executed? This sounds rather 
problematic.

2) This changes the order of work items. Are you sure that no storage 
client requires writes to be in order?


- Stephan Erb


On Sept. 9, 2016, 7:29 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 9, 2016, 7:29 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron 

Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-09 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148349
---


Ship it!




Master (c7f710a) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 9, 2016, 5:29 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 9, 2016, 5:29 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but will become obvious in the part 2 (coming out shortly).  
> 
> [1] - 
> https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
>