Re: Review Request 54288: Make leader elections resilient to ZK disconnections.

2016-12-02 Thread Joshua Cohen


> On Dec. 2, 2016, 3:58 p.m., Joshua Cohen wrote:
> > src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java,
> >  lines 215-218
> > 
> >
> > Should we have an escape hatch for the case where we never become 
> > leader in this test (i.e. sleep for up to N seconds then `fail()`)?
> 
> Zameer Manji wrote:
> Perhaps I could just add a timeout to the test? JUnit allows me to do 
> that.

Timeout on the test works for me.


- Joshua


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54288/#review157759
---


On Dec. 2, 2016, 3:19 a.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54288/
> ---
> 
> (Updated Dec. 2, 2016, 3:19 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, John Sirois, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1669
> https://issues.apache.org/jira/browse/AURORA-1669
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
> leadership if the ZK connection is lost or if there is a timeout. This is not
> compatible with the commons based implementation which would only abdicate
> leadership if the ZK session timeout occurred.
> 
> This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
> custom listener that only loses leadership if a connection loss occurs.
> 
> 
> Diffs
> -
> 
>   
> commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java
>  50acaeba82e163f8f2970a264cbd889c9eb3b5ed 
>   
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java
>  c378172c850aafe0a9381552b5067277b40dbfab 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java
>  a2b4125369d1f6c0a79bc4ac0fb3d2dab8a6c583 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
>  6ea49b0c690d288ff59d1d4798144bfa2d153d3a 
> 
> Diff: https://reviews.apache.org/r/54288/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>



Re: Review Request 54288: Make leader elections resilient to ZK disconnections.

2016-12-02 Thread Karthik Anantha Padmanabhan


> On Dec. 2, 2016, 10:37 p.m., Zameer Manji wrote:
> > Can someone else validate that the tests pass for them locally? I can't 
> > reproduce the jenkins failure.

I wonder if this is a timing issue in the tests ?

`expectGroupEvent(PathChildrenCacheEvent.Type.CHILD_ADDED)` returns as soon as 
there is a CHILD_ADDED event. Is it possible `listener.onLeading` isn't called 
by then ?

assertTrue(capture.hasCaptured())


- Karthik


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54288/#review157847
---


On Dec. 2, 2016, 3:19 a.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54288/
> ---
> 
> (Updated Dec. 2, 2016, 3:19 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, John Sirois, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1669
> https://issues.apache.org/jira/browse/AURORA-1669
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
> leadership if the ZK connection is lost or if there is a timeout. This is not
> compatible with the commons based implementation which would only abdicate
> leadership if the ZK session timeout occurred.
> 
> This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
> custom listener that only loses leadership if a connection loss occurs.
> 
> 
> Diffs
> -
> 
>   
> commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java
>  50acaeba82e163f8f2970a264cbd889c9eb3b5ed 
>   
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java
>  c378172c850aafe0a9381552b5067277b40dbfab 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java
>  a2b4125369d1f6c0a79bc4ac0fb3d2dab8a6c583 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
>  6ea49b0c690d288ff59d1d4798144bfa2d153d3a 
> 
> Diff: https://reviews.apache.org/r/54288/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>



Re: Review Request 54288: Make leader elections resilient to ZK disconnections.

2016-12-02 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54288/#review157847
---



Can someone else validate that the tests pass for them locally? I can't 
reproduce the jenkins failure.

- Zameer Manji


On Dec. 1, 2016, 7:19 p.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54288/
> ---
> 
> (Updated Dec. 1, 2016, 7:19 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, John Sirois, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1669
> https://issues.apache.org/jira/browse/AURORA-1669
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
> leadership if the ZK connection is lost or if there is a timeout. This is not
> compatible with the commons based implementation which would only abdicate
> leadership if the ZK session timeout occurred.
> 
> This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
> custom listener that only loses leadership if a connection loss occurs.
> 
> 
> Diffs
> -
> 
>   
> commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java
>  50acaeba82e163f8f2970a264cbd889c9eb3b5ed 
>   
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java
>  c378172c850aafe0a9381552b5067277b40dbfab 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java
>  a2b4125369d1f6c0a79bc4ac0fb3d2dab8a6c583 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
>  6ea49b0c690d288ff59d1d4798144bfa2d153d3a 
> 
> Diff: https://reviews.apache.org/r/54288/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>



Re: Review Request 54288: Make leader elections resilient to ZK disconnections.

2016-12-02 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54288/#review157844
---



Master (4bc5246) is red with this patch.
  ./build-support/jenkins/build.sh

at 
org.gradle.launcher.daemon.server.exec.EstablishBuildEnvironment.doBuild(EstablishBuildEnvironment.java:72)
at 
org.gradle.launcher.daemon.server.exec.BuildCommandOnly.execute(BuildCommandOnly.java:36)
at 
org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
at 
org.gradle.launcher.daemon.server.exec.HintGCAfterBuild.execute(HintGCAfterBuild.java:44)
at 
org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
at 
org.gradle.launcher.daemon.server.exec.StartBuildOrRespondWithBusy$1.run(StartBuildOrRespondWithBusy.java:50)
at 
org.gradle.launcher.daemon.server.DaemonStateCoordinator$1.run(DaemonStateCoordinator.java:293)
at 
org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:54)
at 
org.gradle.internal.concurrent.StoppableExecutorImpl$1.run(StoppableExecutorImpl.java:40)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
:pmdTest
:test

org.apache.aurora.scheduler.discovery.CuratorSingletonServiceTest > 
testLeadAdvertise FAILED
java.lang.AssertionError at CuratorSingletonServiceTest.java:84
java.lang.AssertionError
I1202 22:25:10.665 [ShutdownHook, SchedulerMain] Stopping scheduler services. 

1068 tests completed, 1 failed, 2 skipped
:test FAILED
:jacocoTestReport
Coverage report generated: 
file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/jacoco/test/html/index.html
:analyzeReport
Instruction coverage is 0.8879984514130855, but must be greater than 0.89
Branch coverage is 0.8017550857598723, but must be greater than 0.835

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> There were failing tests. See the report at: 
> file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/tests/test/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output.

BUILD FAILED

Total time: 3 mins 59.199 secs


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Dec. 2, 2016, 3:19 a.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54288/
> ---
> 
> (Updated Dec. 2, 2016, 3:19 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, John Sirois, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1669
> https://issues.apache.org/jira/browse/AURORA-1669
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
> leadership if the ZK connection is lost or if there is a timeout. This is not
> compatible with the commons based implementation which would only abdicate
> leadership if the ZK session timeout occurred.
> 
> This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
> custom listener that only loses leadership if a connection loss occurs.
> 
> 
> Diffs
> -
> 
>   
> commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java
>  50acaeba82e163f8f2970a264cbd889c9eb3b5ed 
>   
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java
>  c378172c850aafe0a9381552b5067277b40dbfab 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java
>  a2b4125369d1f6c0a79bc4ac0fb3d2dab8a6c583 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
>  6ea49b0c690d288ff59d1d4798144bfa2d153d3a 
> 
> Diff: https://reviews.apache.org/r/54288/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>



Re: Review Request 54288: Make leader elections resilient to ZK disconnections.

2016-12-02 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54288/#review157840
---



@ReviewBot retry

- Zameer Manji


On Dec. 1, 2016, 7:19 p.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54288/
> ---
> 
> (Updated Dec. 1, 2016, 7:19 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, John Sirois, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1669
> https://issues.apache.org/jira/browse/AURORA-1669
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
> leadership if the ZK connection is lost or if there is a timeout. This is not
> compatible with the commons based implementation which would only abdicate
> leadership if the ZK session timeout occurred.
> 
> This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
> custom listener that only loses leadership if a connection loss occurs.
> 
> 
> Diffs
> -
> 
>   
> commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java
>  50acaeba82e163f8f2970a264cbd889c9eb3b5ed 
>   
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java
>  c378172c850aafe0a9381552b5067277b40dbfab 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java
>  a2b4125369d1f6c0a79bc4ac0fb3d2dab8a6c583 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
>  6ea49b0c690d288ff59d1d4798144bfa2d153d3a 
> 
> Diff: https://reviews.apache.org/r/54288/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>



Re: Review Request 54269: Improve scheduling throughput via logging changes.

2016-12-02 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54269/#review157838
---



Master (3ea0331) is green with this patch.
  ./build-support/jenkins/build.sh

However, it appears that it might lack test coverage.

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Dec. 2, 2016, 9:51 p.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54269/
> ---
> 
> (Updated Dec. 2, 2016, 9:51 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, and Stephan Erb.
> 
> 
> Bugs: AURORA-1831
> https://issues.apache.org/jira/browse/AURORA-1831
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch makes two logging performance changes.
> 
> First, it reduces the cost of logging by replacing the costly class and line
> patterns with the cheaper logger pattern. We lose line numbers and inner class
> information for much cheaper logging.
> 
> Before
> 
> I1201 15:08:40.560 [AsyncProcessor-0, StateMachine$Builder:389] 
> SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION -> 
> ACTIVE
> 
> 
> After
> 
> I1201 15:06:47.181 [AsyncProcessor-0, StateMachine] SchedulerLifecycle state 
> machine transition LEADER_AWAITING_REGISTRATION -> ACTIVE
> 
> 
> Second, it reduces the verbosity of the `TaskStateMachine` logging where it 
> logs
> the work command from the transition. I don't think there is any operator 
> value
> in logging this (unlike the task state transitions) so I have lowered it to
> debug.
> 
> Performance Before:
> 
> 
> Benchmark   (numPendingTasks)  
> (numTasksToDelete)   Mode  Cnt  Score   Error  Units
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>1000  thrpt   10  2.510 ± 0.557  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   1  thrpt   10  0.272 ± 0.030  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   5  thrpt   10  0.053 ± 0.011  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run   1000 
> N/A  thrpt   10  2.446 ± 0.698  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  1 
> N/A  thrpt   10  0.246 ± 0.018  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  5 
> N/A  thrpt   10  0.041 ± 0.006  ops/s
> 
> 
> Performance After:
> 
> Benchmark   (numPendingTasks)  
> (numTasksToDelete)   Mode  Cnt   Score   Error  Units
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>1000  thrpt   10  14.520 ± 5.696  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   1  thrpt   10   1.290 ± 0.361  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   5  thrpt   10   0.254 ± 0.097  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run   1000 
> N/A  thrpt5   7.303 ± 5.662  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  1 
> N/A  thrpt5   0.726 ± 0.624  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  5 
> N/A  thrpt5   0.124 ± 0.058  ops/s
> 
> 
> There is a performance improvement in the smaller case and no noticable
> degredation in the larger cases. I also verified on a small cluster that the
> improvements exist for the larger cases. I am unable to reduce the error bars
> locally on the `InsertPendingTasksBenchmark`. I suspect the bencmark needs to 
> be
> tweaked to be more consistent.
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md 7a3d33106c08a1cf690b3fce922000433d61bc62 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java 
> 23f256a7d467c79dcd5d906f76af4f0261dfd81d 
>   src/main/resources/logback.xml 84c175cec811fd95db44fd8dfcd514385606042d 
> 
> Diff: https://reviews.apache.org/r/54269/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>



Re: Review Request 54269: Improve scheduling throughput via logging changes.

2016-12-02 Thread Zameer Manji


> On Dec. 1, 2016, 5:41 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java, line 
> > 474
> > 
> >
> > Please use the logging builtin formatting rather than  doing the string 
> > concatenate all the time.

Good catch, done!


On Dec. 1, 2016, 5:41 p.m., Zameer Manji wrote:
> > Please mention the logger config change in the release notes.

Done.


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54269/#review157687
---


On Dec. 1, 2016, 3:13 p.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54269/
> ---
> 
> (Updated Dec. 1, 2016, 3:13 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, and Stephan Erb.
> 
> 
> Bugs: AURORA-1831
> https://issues.apache.org/jira/browse/AURORA-1831
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch makes two logging performance changes.
> 
> First, it reduces the cost of logging by replacing the costly class and line
> patterns with the cheaper logger pattern. We lose line numbers and inner class
> information for much cheaper logging.
> 
> Before
> 
> I1201 15:08:40.560 [AsyncProcessor-0, StateMachine$Builder:389] 
> SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION -> 
> ACTIVE
> 
> 
> After
> 
> I1201 15:06:47.181 [AsyncProcessor-0, StateMachine] SchedulerLifecycle state 
> machine transition LEADER_AWAITING_REGISTRATION -> ACTIVE
> 
> 
> Second, it reduces the verbosity of the `TaskStateMachine` logging where it 
> logs
> the work command from the transition. I don't think there is any operator 
> value
> in logging this (unlike the task state transitions) so I have lowered it to
> debug.
> 
> Performance Before:
> 
> 
> Benchmark   (numPendingTasks)  
> (numTasksToDelete)   Mode  Cnt  Score   Error  Units
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>1000  thrpt   10  2.510 ± 0.557  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   1  thrpt   10  0.272 ± 0.030  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   5  thrpt   10  0.053 ± 0.011  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run   1000 
> N/A  thrpt   10  2.446 ± 0.698  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  1 
> N/A  thrpt   10  0.246 ± 0.018  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  5 
> N/A  thrpt   10  0.041 ± 0.006  ops/s
> 
> 
> Performance After:
> 
> Benchmark   (numPendingTasks)  
> (numTasksToDelete)   Mode  Cnt   Score   Error  Units
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>1000  thrpt   10  14.520 ± 5.696  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   1  thrpt   10   1.290 ± 0.361  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   5  thrpt   10   0.254 ± 0.097  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run   1000 
> N/A  thrpt5   7.303 ± 5.662  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  1 
> N/A  thrpt5   0.726 ± 0.624  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  5 
> N/A  thrpt5   0.124 ± 0.058  ops/s
> 
> 
> There is a performance improvement in the smaller case and no noticable
> degredation in the larger cases. I also verified on a small cluster that the
> improvements exist for the larger cases. I am unable to reduce the error bars
> locally on the `InsertPendingTasksBenchmark`. I suspect the bencmark needs to 
> be
> tweaked to be more consistent.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java 
> 23f256a7d467c79dcd5d906f76af4f0261dfd81d 
>   src/main/resources/logback.xml 84c175cec811fd95db44fd8dfcd514385606042d 
> 
> Diff: https://reviews.apache.org/r/54269/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>



Re: Review Request 54269: Improve scheduling throughput via logging changes.

2016-12-02 Thread Zameer Manji


> On Dec. 2, 2016, 11:29 a.m., Joshua Cohen wrote:
> > src/main/resources/logback.xml, line 26
> > 
> >
> > Can you add a comment here explaining why we're not using `%class` and 
> > `%line`? They're definitely valuable when debugging, so I can imagine 
> > someone coming back in the future to change this.

Done.


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54269/#review157799
---


On Dec. 1, 2016, 3:13 p.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54269/
> ---
> 
> (Updated Dec. 1, 2016, 3:13 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, and Stephan Erb.
> 
> 
> Bugs: AURORA-1831
> https://issues.apache.org/jira/browse/AURORA-1831
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch makes two logging performance changes.
> 
> First, it reduces the cost of logging by replacing the costly class and line
> patterns with the cheaper logger pattern. We lose line numbers and inner class
> information for much cheaper logging.
> 
> Before
> 
> I1201 15:08:40.560 [AsyncProcessor-0, StateMachine$Builder:389] 
> SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION -> 
> ACTIVE
> 
> 
> After
> 
> I1201 15:06:47.181 [AsyncProcessor-0, StateMachine] SchedulerLifecycle state 
> machine transition LEADER_AWAITING_REGISTRATION -> ACTIVE
> 
> 
> Second, it reduces the verbosity of the `TaskStateMachine` logging where it 
> logs
> the work command from the transition. I don't think there is any operator 
> value
> in logging this (unlike the task state transitions) so I have lowered it to
> debug.
> 
> Performance Before:
> 
> 
> Benchmark   (numPendingTasks)  
> (numTasksToDelete)   Mode  Cnt  Score   Error  Units
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>1000  thrpt   10  2.510 ± 0.557  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   1  thrpt   10  0.272 ± 0.030  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   5  thrpt   10  0.053 ± 0.011  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run   1000 
> N/A  thrpt   10  2.446 ± 0.698  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  1 
> N/A  thrpt   10  0.246 ± 0.018  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  5 
> N/A  thrpt   10  0.041 ± 0.006  ops/s
> 
> 
> Performance After:
> 
> Benchmark   (numPendingTasks)  
> (numTasksToDelete)   Mode  Cnt   Score   Error  Units
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>1000  thrpt   10  14.520 ± 5.696  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   1  thrpt   10   1.290 ± 0.361  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   5  thrpt   10   0.254 ± 0.097  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run   1000 
> N/A  thrpt5   7.303 ± 5.662  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  1 
> N/A  thrpt5   0.726 ± 0.624  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  5 
> N/A  thrpt5   0.124 ± 0.058  ops/s
> 
> 
> There is a performance improvement in the smaller case and no noticable
> degredation in the larger cases. I also verified on a small cluster that the
> improvements exist for the larger cases. I am unable to reduce the error bars
> locally on the `InsertPendingTasksBenchmark`. I suspect the bencmark needs to 
> be
> tweaked to be more consistent.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java 
> 23f256a7d467c79dcd5d906f76af4f0261dfd81d 
>   src/main/resources/logback.xml 84c175cec811fd95db44fd8dfcd514385606042d 
> 
> Diff: https://reviews.apache.org/r/54269/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>



Re: Review Request 54269: Improve scheduling throughput via logging changes.

2016-12-02 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54269/
---

(Updated Dec. 2, 2016, 1:51 p.m.)


Review request for Aurora, David McLaughlin, Joshua Cohen, and Stephan Erb.


Bugs: AURORA-1831
https://issues.apache.org/jira/browse/AURORA-1831


Repository: aurora


Description
---

This patch makes two logging performance changes.

First, it reduces the cost of logging by replacing the costly class and line
patterns with the cheaper logger pattern. We lose line numbers and inner class
information for much cheaper logging.

Before

I1201 15:08:40.560 [AsyncProcessor-0, StateMachine$Builder:389] 
SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION -> 
ACTIVE


After

I1201 15:06:47.181 [AsyncProcessor-0, StateMachine] SchedulerLifecycle state 
machine transition LEADER_AWAITING_REGISTRATION -> ACTIVE


Second, it reduces the verbosity of the `TaskStateMachine` logging where it logs
the work command from the transition. I don't think there is any operator value
in logging this (unlike the task state transitions) so I have lowered it to
debug.

Performance Before:


Benchmark   (numPendingTasks)  
(numTasksToDelete)   Mode  Cnt  Score   Error  Units
StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A   
 1000  thrpt   10  2.510 ± 0.557  ops/s
StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A   
1  thrpt   10  0.272 ± 0.030  ops/s
StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A   
5  thrpt   10  0.053 ± 0.011  ops/s
StateManagerBenchmarks.InsertPendingTasksBenchmark.run   1000   
  N/A  thrpt   10  2.446 ± 0.698  ops/s
StateManagerBenchmarks.InsertPendingTasksBenchmark.run  1   
  N/A  thrpt   10  0.246 ± 0.018  ops/s
StateManagerBenchmarks.InsertPendingTasksBenchmark.run  5   
  N/A  thrpt   10  0.041 ± 0.006  ops/s


Performance After:

Benchmark   (numPendingTasks)  
(numTasksToDelete)   Mode  Cnt   Score   Error  Units
StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A   
 1000  thrpt   10  14.520 ± 5.696  ops/s
StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A   
1  thrpt   10   1.290 ± 0.361  ops/s
StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A   
5  thrpt   10   0.254 ± 0.097  ops/s
StateManagerBenchmarks.InsertPendingTasksBenchmark.run   1000   
  N/A  thrpt5   7.303 ± 5.662  ops/s
StateManagerBenchmarks.InsertPendingTasksBenchmark.run  1   
  N/A  thrpt5   0.726 ± 0.624  ops/s
StateManagerBenchmarks.InsertPendingTasksBenchmark.run  5   
  N/A  thrpt5   0.124 ± 0.058  ops/s


There is a performance improvement in the smaller case and no noticable
degredation in the larger cases. I also verified on a small cluster that the
improvements exist for the larger cases. I am unable to reduce the error bars
locally on the `InsertPendingTasksBenchmark`. I suspect the bencmark needs to be
tweaked to be more consistent.


Diffs (updated)
-

  RELEASE-NOTES.md 7a3d33106c08a1cf690b3fce922000433d61bc62 
  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java 
23f256a7d467c79dcd5d906f76af4f0261dfd81d 
  src/main/resources/logback.xml 84c175cec811fd95db44fd8dfcd514385606042d 

Diff: https://reviews.apache.org/r/54269/diff/


Testing
---


Thanks,

Zameer Manji



Re: Review Request 54299: Extend warm-up time by `max_consecutive_failures` attempts.

2016-12-02 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54299/#review157764
---




src/main/python/apache/aurora/executor/common/health_checker.py (line 113)


s/suppose/supposed



src/main/python/apache/aurora/executor/common/health_checker.py (lines 115 - 
117)


There still exists the chance for a backwards incompatibility here. Under 
the previous watch-driven updates, a task could flip between failing and 
successful health checks, and as long as it's still running at the end of 
`watch_secs` the updater would consider it healthy and move on. With this new 
behavior, someone could configure a task in such a way that the max attempts 
are consumed without reaching `max_consecutive_failures` or 
`min_consecutive_successes` before `watch_secs` is elapsed, meaning that the 
task would fail.

As we discussed earlier, if we make `watch_secs` and 
`min_consecutive_successes` mutually exclusive in the client, then the executor 
could only trigger the new behavior if the user opted in by setting 
`watch_secs` to 0 and `min_consecutive_successes` to non-zero.


- Joshua Cohen


On Dec. 2, 2016, 8:43 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54299/
> ---
> 
> (Updated Dec. 2, 2016, 8:43 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, Stephan Erb, and 
> Zameer Manji.
> 
> 
> Bugs: AURORA-1841
> https://issues.apache.org/jira/browse/AURORA-1841
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> It is possible to set the health checks such that a task can
> continually fail health checks with intermittent successes and still
> succeed an update. Essentially a task fails health checks during the
> `initial_interval_secs` and an additional `max_consecutive_failures`,
> and then perform a successful health check to become healthy.
> 
> To be backward compatible to the above configuration, include the
> `max_consecutive_failures` when computing `max_attempts_to_running`.
> 
> 
> Diffs
> -
> 
>   docs/features/services.md 50189eeff26ce9614d092f6abd9246788647fe2b 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 12af9d8635a553eabe918a86508aa6ce2fd78a49 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> e2a7f164a24f49dd1f4cdba136e838b9d42d73a2 
> 
> Diff: https://reviews.apache.org/r/54299/diff/
> 
> 
> Testing
> ---
> 
> build-support/jenkins/build.sh
> src/test/sh/org/apacher/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 54299: Extend warm-up time by `max_consecutive_failures` attempts.

2016-12-02 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54299/#review157807
---


Ship it!




It took me a long time to understand this after staring at the tests, but I 
think this is correct.

This is unfortunately a little complex to understand. For bonus points, would 
it be possible to encode some of this information in a diagram?

The tests are thourough, which makes me comfortable in shipping this change.

- Zameer Manji


On Dec. 2, 2016, 12:43 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54299/
> ---
> 
> (Updated Dec. 2, 2016, 12:43 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, Stephan Erb, and 
> Zameer Manji.
> 
> 
> Bugs: AURORA-1841
> https://issues.apache.org/jira/browse/AURORA-1841
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> It is possible to set the health checks such that a task can
> continually fail health checks with intermittent successes and still
> succeed an update. Essentially a task fails health checks during the
> `initial_interval_secs` and an additional `max_consecutive_failures`,
> and then perform a successful health check to become healthy.
> 
> To be backward compatible to the above configuration, include the
> `max_consecutive_failures` when computing `max_attempts_to_running`.
> 
> 
> Diffs
> -
> 
>   docs/features/services.md 50189eeff26ce9614d092f6abd9246788647fe2b 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 12af9d8635a553eabe918a86508aa6ce2fd78a49 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> e2a7f164a24f49dd1f4cdba136e838b9d42d73a2 
> 
> Diff: https://reviews.apache.org/r/54299/diff/
> 
> 
> Testing
> ---
> 
> build-support/jenkins/build.sh
> src/test/sh/org/apacher/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 54269: Improve scheduling throughput via logging changes.

2016-12-02 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54269/#review157799
---


Ship it!





src/main/resources/logback.xml (line 26)


Can you add a comment here explaining why we're not using `%class` and 
`%line`? They're definitely valuable when debugging, so I can imagine someone 
coming back in the future to change this.


- Joshua Cohen


On Dec. 1, 2016, 11:13 p.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54269/
> ---
> 
> (Updated Dec. 1, 2016, 11:13 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, and Stephan Erb.
> 
> 
> Bugs: AURORA-1831
> https://issues.apache.org/jira/browse/AURORA-1831
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch makes two logging performance changes.
> 
> First, it reduces the cost of logging by replacing the costly class and line
> patterns with the cheaper logger pattern. We lose line numbers and inner class
> information for much cheaper logging.
> 
> Before
> 
> I1201 15:08:40.560 [AsyncProcessor-0, StateMachine$Builder:389] 
> SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION -> 
> ACTIVE
> 
> 
> After
> 
> I1201 15:06:47.181 [AsyncProcessor-0, StateMachine] SchedulerLifecycle state 
> machine transition LEADER_AWAITING_REGISTRATION -> ACTIVE
> 
> 
> Second, it reduces the verbosity of the `TaskStateMachine` logging where it 
> logs
> the work command from the transition. I don't think there is any operator 
> value
> in logging this (unlike the task state transitions) so I have lowered it to
> debug.
> 
> Performance Before:
> 
> 
> Benchmark   (numPendingTasks)  
> (numTasksToDelete)   Mode  Cnt  Score   Error  Units
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>1000  thrpt   10  2.510 ± 0.557  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   1  thrpt   10  0.272 ± 0.030  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   5  thrpt   10  0.053 ± 0.011  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run   1000 
> N/A  thrpt   10  2.446 ± 0.698  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  1 
> N/A  thrpt   10  0.246 ± 0.018  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  5 
> N/A  thrpt   10  0.041 ± 0.006  ops/s
> 
> 
> Performance After:
> 
> Benchmark   (numPendingTasks)  
> (numTasksToDelete)   Mode  Cnt   Score   Error  Units
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>1000  thrpt   10  14.520 ± 5.696  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   1  thrpt   10   1.290 ± 0.361  ops/s
> StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A 
>   5  thrpt   10   0.254 ± 0.097  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run   1000 
> N/A  thrpt5   7.303 ± 5.662  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  1 
> N/A  thrpt5   0.726 ± 0.624  ops/s
> StateManagerBenchmarks.InsertPendingTasksBenchmark.run  5 
> N/A  thrpt5   0.124 ± 0.058  ops/s
> 
> 
> There is a performance improvement in the smaller case and no noticable
> degredation in the larger cases. I also verified on a small cluster that the
> improvements exist for the larger cases. I am unable to reduce the error bars
> locally on the `InsertPendingTasksBenchmark`. I suspect the bencmark needs to 
> be
> tweaked to be more consistent.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java 
> 23f256a7d467c79dcd5d906f76af4f0261dfd81d 
>   src/main/resources/logback.xml 84c175cec811fd95db44fd8dfcd514385606042d 
> 
> Diff: https://reviews.apache.org/r/54269/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>



Re: Review Request 54288: Make leader elections resilient to ZK disconnections.

2016-12-02 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54288/#review157759
---



Thanks for picking this up! This is a basic question, but I just want to be 
sure: by mimicking the old behavior, we're not running the risk of 
re-introducing the same deadlock we were trying to fix by moving to Curator, 
right? I'm not sure where the deadlock was caused... was it in our 
implementation of the `SingletonService` recipe, was it in the ZK client 
itself, or somewhere else entirely?


src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
 (lines 215 - 218)


Should we have an escape hatch for the case where we never become leader in 
this test (i.e. sleep for up to N seconds then `fail()`)?


- Joshua Cohen


On Dec. 2, 2016, 3:19 a.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54288/
> ---
> 
> (Updated Dec. 2, 2016, 3:19 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, John Sirois, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1669
> https://issues.apache.org/jira/browse/AURORA-1669
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
> leadership if the ZK connection is lost or if there is a timeout. This is not
> compatible with the commons based implementation which would only abdicate
> leadership if the ZK session timeout occurred.
> 
> This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
> custom listener that only loses leadership if a connection loss occurs.
> 
> 
> Diffs
> -
> 
>   
> commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java
>  50acaeba82e163f8f2970a264cbd889c9eb3b5ed 
>   
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java
>  c378172c850aafe0a9381552b5067277b40dbfab 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java
>  a2b4125369d1f6c0a79bc4ac0fb3d2dab8a6c583 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
>  6ea49b0c690d288ff59d1d4798144bfa2d153d3a 
> 
> Diff: https://reviews.apache.org/r/54288/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>



Re: Review Request 54299: Extend warm-up time by `max_consecutive_failures` attempts.

2016-12-02 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54299/#review157722
---


Ship it!




Master (3ea0331) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Dec. 2, 2016, 8:43 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54299/
> ---
> 
> (Updated Dec. 2, 2016, 8:43 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, Stephan Erb, and 
> Zameer Manji.
> 
> 
> Bugs: AURORA-1841
> https://issues.apache.org/jira/browse/AURORA-1841
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> It is possible to set the health checks such that a task can
> continually fail health checks with intermittent successes and still
> succeed an update. Essentially a task fails health checks during the
> `initial_interval_secs` and an additional `max_consecutive_failures`,
> and then perform a successful health check to become healthy.
> 
> To be backward compatible to the above configuration, include the
> `max_consecutive_failures` when computing `max_attempts_to_running`.
> 
> 
> Diffs
> -
> 
>   docs/features/services.md 50189eeff26ce9614d092f6abd9246788647fe2b 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 12af9d8635a553eabe918a86508aa6ce2fd78a49 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> e2a7f164a24f49dd1f4cdba136e838b9d42d73a2 
> 
> Diff: https://reviews.apache.org/r/54299/diff/
> 
> 
> Testing
> ---
> 
> build-support/jenkins/build.sh
> src/test/sh/org/apacher/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Review Request 54299: Extend warm-up time by `max_consecutive_failures` attempts.

2016-12-02 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54299/
---

Review request for Aurora, David McLaughlin, Joshua Cohen, Stephan Erb, and 
Zameer Manji.


Bugs: AURORA-1841
https://issues.apache.org/jira/browse/AURORA-1841


Repository: aurora


Description
---

It is possible to set the health checks such that a task can
continually fail health checks with intermittent successes and still
succeed an update. Essentially a task fails health checks during the
`initial_interval_secs` and an additional `max_consecutive_failures`,
and then perform a successful health check to become healthy.

To be backward compatible to the above configuration, include the
`max_consecutive_failures` when computing `max_attempts_to_running`.


Diffs
-

  docs/features/services.md 50189eeff26ce9614d092f6abd9246788647fe2b 
  src/main/python/apache/aurora/executor/common/health_checker.py 
12af9d8635a553eabe918a86508aa6ce2fd78a49 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
e2a7f164a24f49dd1f4cdba136e838b9d42d73a2 

Diff: https://reviews.apache.org/r/54299/diff/


Testing
---

build-support/jenkins/build.sh
src/test/sh/org/apacher/aurora/e2e/test_end_to_end.sh


Thanks,

Santhosh Kumar Shanmugham