Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review150035
---


Ship it!





src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 195)


For posterity,

This is important because the Guava docs says that `run()` should respond 
to shutdown requests by checking `isRunning()`. If we block indefinately we 
technically don't respond to shutdown requests.


- Zameer Manji


On Sept. 22, 2016, 8:20 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 8:20 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Joshua Cohen


> On Sept. 22, 2016, 2:24 p.m., Joshua Cohen wrote:
> > What's the behavior pre-BatchWorker in this case? Would we fail hard, or is 
> > there something inherent to the batch worker that makes this necessary?
> > 
> > Also, this fixes the error handling that led to corruption in the event of 
> > a failure, but should we also determine the root cause that led to the 
> > failure to begin with (as part of another review no doubt)?
> 
> Maxim Khutornenko wrote:
> The pre-BatchWorker behavior depended on where/how the error was thrown 
> and could have resulted in a hard shutdown or a mysterious silent async 
> failure killing one of the executor threads. With the BatchWorker, the 
> scheduler will always fail on unhandled exception. This is needed to preserve 
> the transaction boundaries and let the error rollback the changes.
> 
> > ...but should we also determine the root cause that led to the failure 
> to begin with  
> 
> I thought it should have been clear from the RB description but then I 
> realized I explained it in Slack and not here. We purposefully induce 
> scheduler failovers many times during the e2e run to verify things like job 
> updates still work. If the shutdown happens during the transaction, the DB 
> gets closed underneath and the subsequent queries fail any way they like. 
> This is exactly what happened in this case. A task is saved, driver exits, 
> StateManager attempts to save task events but fails due to "DB closed" error. 
> That error is caught by the BatchWorker and the native log transaction 
> successfully completes before the scheduler fully terminates.

Thanks, I missed that in slack!


- Joshua


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review14
---


On Sept. 22, 2016, 3:20 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 3:20 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review150026
---


Ship it!




Ship It!

- Joshua Cohen


On Sept. 22, 2016, 3:20 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 3:20 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review150020
---


Ship it!




Master (d3c5ca7) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 22, 2016, 3:20 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 3:20 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/
---

(Updated Sept. 22, 2016, 3:20 p.m.)


Review request for Aurora, Stephan Erb and Zameer Manji.


Changes
---

Removing redundant shutdown hook.


Bugs: AURORA-1779
https://issues.apache.org/jira/browse/AURORA-1779


Repository: aurora


Description
---

##Problem

The current `BatchWorker` error handling assumes graceful recovery if any of 
the individual batch items fail. This may not hold true if the storage is 
modified _before_ the error is thrown and the `LogStorage` transaction is not 
rolled back. Consider the following fragment from the `StateManagerImpl`:
```
storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);

for (IScheduledTask scheduledTask : scheduledTasks) {
  updateTaskAndExternalState(
  storeProvider.getUnsafeTaskStore(),
  Tasks.id(scheduledTask),
  Optional.of(scheduledTask),
  Optional.of(PENDING),
  Optional.absent());
}
```
If a task is saved but the subsequent `updateTaskAndExternalState()` throws AND 
`BatchWorker` catches and logs that exception, the native store transaction 
will be committed and as a result the storage will be left in a logically 
inconsistent state.

##Solution

I can see at least 3 ways to solve this problem:
1. Fail hard and shutdown scheduler when any of the batch items throw;
2. Fail only the last batch (drop all its items) and let storage transaction 
roll back;
3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
failed item would be retried when the storage transaction rolls back;

The #3 is the most involved and would be very hard to get right (assuming batch 
items are idempotent, which may not be the case). The #2 may result in a very 
hard to troubleshoot scenario where _some_ items would be dropped on the floor 
and never completed. 

I suggest we go with #1 as the most straightforward and transparent approach. 
It's also the only one that ensures storage state consistency, which is the 
most vital invariant to preserve (as AURORA-1779 proves). The only downside of 
this approach is that scheduler will go down hard any time an unhandled error 
is thrown but arguably this is the easiest way to improve our codebase and 
certainly better than leaving the scheduler in a crippled state.


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
e05d4b48749b0691902a505bea1b4490fdd08f29 
  src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
a86dc82cafa7f5436d2b8d00c6db575ff8311eea 

Diff: https://reviews.apache.org/r/52141/diff/


Testing
---

unit and e2e tests


Thanks,

Maxim Khutornenko



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Maxim Khutornenko


> On Sept. 22, 2016, 2:24 p.m., Joshua Cohen wrote:
> > What's the behavior pre-BatchWorker in this case? Would we fail hard, or is 
> > there something inherent to the batch worker that makes this necessary?
> > 
> > Also, this fixes the error handling that led to corruption in the event of 
> > a failure, but should we also determine the root cause that led to the 
> > failure to begin with (as part of another review no doubt)?

The pre-BatchWorker behavior depended on where/how the error was thrown and 
could have resulted in a hard shutdown or a mysterious silent async failure 
killing one of the executor threads. With the BatchWorker, the scheduler will 
always fail on unhandled exception. This is needed to preserve the transaction 
boundaries and let the error rollback the changes.

> ...but should we also determine the root cause that led to the failure to 
> begin with  

I thought it should have been clear from the RB description but then I realized 
I explained it in Slack and not here. We purposefully induce scheduler 
failovers many times during the e2e run to verify things like job updates still 
work. If the shutdown happens during the transaction, the DB gets closed 
underneath and the subsequent queries fail any way they like. This is exactly 
what happened in this case. A task is saved, driver exits, StateManager 
attempts to save task events but fails due to "DB closed" error. That error is 
caught by the BatchWorker and the native log transaction successfully completes 
before the scheduler fully terminates.


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review14
---


On Sept. 22, 2016, 5:38 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 5:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82ca

Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Maxim Khutornenko


> On Sept. 22, 2016, 2:25 p.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 152
> > 
> >
> > Something is wrong here.
> > 
> > The listener from `GuavaUtils.LifecycleShutdownListener` should have 
> > already been applied to this service in `SchedulerServicesModule`. We used 
> > `addSchedulerActiveServiceBinding` to add the BatchWorker services so I'm 
> > not sure why this helps at all.

Ah, that's right, that should be enough. So, the fix is even simpler, just let 
the error bubble up and the existing shutdown hook will take care of the rest. 
Verified in vagrant that's exactly what happens.


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review15
---


On Sept. 22, 2016, 5:38 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 5:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 8556253fc11f6027316871eb9dc66d8627a77fe6 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review15
---




src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 152)


Something is wrong here.

The listener from `GuavaUtils.LifecycleShutdownListener` should have 
already been applied to this service in `SchedulerServicesModule`. We used 
`addSchedulerActiveServiceBinding` to add the BatchWorker services so I'm not 
sure why this helps at all.


- Zameer Manji


On Sept. 21, 2016, 10:38 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 21, 2016, 10:38 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 8556253fc11f6027316871eb9dc66d8627a77fe6 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review14
---



What's the behavior pre-BatchWorker in this case? Would we fail hard, or is 
there something inherent to the batch worker that makes this necessary?

Also, this fixes the error handling that led to corruption in the event of a 
failure, but should we also determine the root cause that led to the failure to 
begin with (as part of another review no doubt)?

- Joshua Cohen


On Sept. 22, 2016, 5:38 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 5:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 8556253fc11f6027316871eb9dc66d8627a77fe6 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-22 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review149963
---


Ship it!




Ship It!

- Stephan Erb


On Sept. 22, 2016, 7:38 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 7:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 8556253fc11f6027316871eb9dc66d8627a77fe6 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-21 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review149954
---


Ship it!




Master (d3c5ca7) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 22, 2016, 5:38 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 5:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 8556253fc11f6027316871eb9dc66d8627a77fe6 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-21 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/
---

Review request for Aurora, Stephan Erb and Zameer Manji.


Bugs: AURORA-1779
https://issues.apache.org/jira/browse/AURORA-1779


Repository: aurora


Description
---

##Problem

The current `BatchWorker` error handling assumes graceful recovery if any of 
the individual batch items fail. This may not hold true if the storage is 
modified _before_ the error is thrown and the `LogStorage` transaction is not 
rolled back. Consider the following fragment from the `StateManagerImpl`:
```
storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);

for (IScheduledTask scheduledTask : scheduledTasks) {
  updateTaskAndExternalState(
  storeProvider.getUnsafeTaskStore(),
  Tasks.id(scheduledTask),
  Optional.of(scheduledTask),
  Optional.of(PENDING),
  Optional.absent());
}
```
If a task is saved but the subsequent `updateTaskAndExternalState()` throws AND 
`BatchWorker` catches and logs that exception, the native store transaction 
will be committed and as a result the storage will be left in a logically 
inconsistent state.

##Solution

I can see at least 3 ways to solve this problem:
1. Fail hard and shutdown scheduler when any of the batch items throw;
2. Fail only the last batch (drop all its items) and let storage transaction 
roll back;
3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
failed item would be retried when the storage transaction rolls back;

The #3 is the most involved and would be very hard to get right (assuming batch 
items are idempotent, which may not be the case). The #2 may result in a very 
hard to troubleshoot scenario where _some_ items would be dropped on the floor 
and never completed. 

I suggest we go with #1 as the most straightforward and transparent approach. 
It's also the only one that ensures storage state consistency, which is the 
most vital invariant to preserve (as AURORA-1779 proves). The only downside of 
this approach is that scheduler will go down hard any time an unhandled error 
is thrown but arguably this is the easiest way to improve our codebase and 
certainly better than leaving the scheduler in a crippled state.


Diffs
-

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
e05d4b48749b0691902a505bea1b4490fdd08f29 
  src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
2ec3967ddb1d470cf681de062a6400f647978185 
  src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
7c8047a7235a937c29fe96517242923ff84a080c 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
d390c07522d22e43d79ce4370985f3643ef021ca 
  src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
  src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
8556253fc11f6027316871eb9dc66d8627a77fe6 

Diff: https://reviews.apache.org/r/52141/diff/


Testing
---

unit and e2e tests


Thanks,

Maxim Khutornenko