Re: Review Request 54288: Make leader elections resilient to ZK disconnections.

2016-12-11 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54288/#review158815
---



Master (c8e8953) is red with this patch.
  ./build-support/jenkins/build.sh

at 
org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
at 
org.gradle.launcher.daemon.server.exec.EstablishBuildEnvironment.doBuild(EstablishBuildEnvironment.java:72)
at 
org.gradle.launcher.daemon.server.exec.BuildCommandOnly.execute(BuildCommandOnly.java:36)
at 
org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
at 
org.gradle.launcher.daemon.server.exec.HintGCAfterBuild.execute(HintGCAfterBuild.java:44)
at 
org.gradle.launcher.daemon.server.api.DaemonCommandExecution.proceed(DaemonCommandExecution.java:120)
at 
org.gradle.launcher.daemon.server.exec.StartBuildOrRespondWithBusy$1.run(StartBuildOrRespondWithBusy.java:50)
at 
org.gradle.launcher.daemon.server.DaemonStateCoordinator$1.run(DaemonStateCoordinator.java:293)
at 
org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:54)
at 
org.gradle.internal.concurrent.StoppableExecutorImpl$1.run(StoppableExecutorImpl.java:40)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
:pmdTest
:test

org.apache.aurora.scheduler.discovery.CuratorSingletonServiceTest > 
testZKDisconnection FAILED
java.lang.AssertionError at CuratorSingletonServiceTest.java:233
I1212 01:48:33.031 [ShutdownHook, SchedulerMain] Stopping scheduler services. 

1068 tests completed, 1 failed, 2 skipped
:test FAILED
:jacocoTestReport
Coverage report generated: 
file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/jacoco/test/html/index.html
:analyzeReport
Instruction coverage is 0.8877981722428748, but must be greater than 0.89
Branch coverage is 0.8013562026326286, but must be greater than 0.835

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> There were failing tests. See the report at: 
> file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/tests/test/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output.

BUILD FAILED

Total time: 3 mins 46.037 secs


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Dec. 12, 2016, 12:43 a.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54288/
> ---
> 
> (Updated Dec. 12, 2016, 12:43 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, John Sirois, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1669
> https://issues.apache.org/jira/browse/AURORA-1669
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
> leadership if the ZK connection is lost or if there is a timeout. This is not
> compatible with the commons based implementation which would only abdicate
> leadership if the ZK session timeout occurred.
> 
> This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
> custom listener that only loses leadership if a connection loss occurs.
> 
> 
> Diffs
> -
> 
>   
> commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java
>  50acaeba82e163f8f2970a264cbd889c9eb3b5ed 
>   
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java
>  c378172c850aafe0a9381552b5067277b40dbfab 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java
>  a2b4125369d1f6c0a79bc4ac0fb3d2dab8a6c583 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
>  6ea49b0c690d288ff59d1d4798144bfa2d153d3a 
> 
> Diff: https://reviews.apache.org/r/54288/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>



Re: Review Request 54624: Expose stats on ZooKeeper connection state

2016-12-11 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54624/#review158814
---



Master (c8e8953) is green with this patch.
  ./build-support/jenkins/build.sh

However, it appears that it might lack test coverage.

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Dec. 12, 2016, 1:32 a.m., Jing Chen wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54624/
> ---
> 
> (Updated Dec. 12, 2016, 1:32 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Mehrdad Nurolahzade, Stephan Erb, 
> and Zameer Manji.
> 
> 
> Bugs: AURORA-1838
> https://issues.apache.org/jira/browse/AURORA-1838
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Expose stats on ZooKeeper connection state
> 
> 
> Diffs
> -
> 
>   
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorServiceDiscoveryModule.java
>  999a542796858dcfe9e31601c47239189043fd52 
> 
> Diff: https://reviews.apache.org/r/54624/diff/
> 
> 
> Testing
> ---
> 
> http://192.168.33.7:8081/vars
> ```
> zk_connection_state_connected 1
> zk_connection_state_lost 0
> zk_connection_state_readonly 0
> zk_connection_state_reconnected 0
> zk_connection_state_suspended 0
> zk_connection_state_unknown 0
> ```
> 
> 
> Thanks,
> 
> Jing Chen
> 
>



Re: Review Request 54624: Expose stats on ZooKeeper connection state

2016-12-11 Thread Jing Chen


> On Dec. 10, 2016, 8:14 p.m., Joshua Cohen wrote:
> > I think the idea behind the ticket was to track these as stats, not just 
> > into the log.

btw, I am not pretty sure if `AtomicLong`s here should be `static` or not


- Jing


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54624/#review158773
---


On Dec. 12, 2016, 1:32 a.m., Jing Chen wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54624/
> ---
> 
> (Updated Dec. 12, 2016, 1:32 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Mehrdad Nurolahzade, Stephan Erb, 
> and Zameer Manji.
> 
> 
> Bugs: AURORA-1838
> https://issues.apache.org/jira/browse/AURORA-1838
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Expose stats on ZooKeeper connection state
> 
> 
> Diffs
> -
> 
>   
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorServiceDiscoveryModule.java
>  999a542796858dcfe9e31601c47239189043fd52 
> 
> Diff: https://reviews.apache.org/r/54624/diff/
> 
> 
> Testing
> ---
> 
> http://192.168.33.7:8081/vars
> ```
> zk_connection_state_connected 1
> zk_connection_state_lost 0
> zk_connection_state_readonly 0
> zk_connection_state_reconnected 0
> zk_connection_state_suspended 0
> zk_connection_state_unknown 0
> ```
> 
> 
> Thanks,
> 
> Jing Chen
> 
>



Re: Review Request 54624: Expose stats on ZooKeeper connection state

2016-12-11 Thread Jing Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54624/
---

(Updated Dec. 12, 2016, 1:32 a.m.)


Review request for Aurora, Joshua Cohen, Mehrdad Nurolahzade, Stephan Erb, and 
Zameer Manji.


Bugs: AURORA-1838
https://issues.apache.org/jira/browse/AURORA-1838


Repository: aurora


Description
---

Expose stats on ZooKeeper connection state


Diffs
-

  
src/main/java/org/apache/aurora/scheduler/discovery/CuratorServiceDiscoveryModule.java
 999a542796858dcfe9e31601c47239189043fd52 

Diff: https://reviews.apache.org/r/54624/diff/


Testing (updated)
---

http://192.168.33.7:8081/vars
```
zk_connection_state_connected 1
zk_connection_state_lost 0
zk_connection_state_readonly 0
zk_connection_state_reconnected 0
zk_connection_state_suspended 0
zk_connection_state_unknown 0
```


Thanks,

Jing Chen



Re: Review Request 54624: Expose stats on ZooKeeper connection state

2016-12-11 Thread Jing Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54624/
---

(Updated Dec. 12, 2016, 1:28 a.m.)


Review request for Aurora, Mehrdad Nurolahzade, Stephan Erb, and Zameer Manji.


Changes
---

expose stats on connection state changes


Bugs: AURORA-1838
https://issues.apache.org/jira/browse/AURORA-1838


Repository: aurora


Description
---

Expose stats on ZooKeeper connection state


Diffs (updated)
-

  
src/main/java/org/apache/aurora/scheduler/discovery/CuratorServiceDiscoveryModule.java
 999a542796858dcfe9e31601c47239189043fd52 

Diff: https://reviews.apache.org/r/54624/diff/


Testing
---

State Change is logged into aurora-scheduler.log:
```
I1210 10:44:37.973 [main-EventThread, ConnectionStateManager] State change: 
CONNECTED
```


Thanks,

Jing Chen



Re: Review Request 54288: Make leader elections resilient to ZK disconnections.

2016-12-11 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54288/#review158812
---



@ReviewBot retry

- Zameer Manji


On Dec. 11, 2016, 4:43 p.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54288/
> ---
> 
> (Updated Dec. 11, 2016, 4:43 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, John Sirois, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1669
> https://issues.apache.org/jira/browse/AURORA-1669
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
> leadership if the ZK connection is lost or if there is a timeout. This is not
> compatible with the commons based implementation which would only abdicate
> leadership if the ZK session timeout occurred.
> 
> This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
> custom listener that only loses leadership if a connection loss occurs.
> 
> 
> Diffs
> -
> 
>   
> commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java
>  50acaeba82e163f8f2970a264cbd889c9eb3b5ed 
>   
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java
>  c378172c850aafe0a9381552b5067277b40dbfab 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java
>  a2b4125369d1f6c0a79bc4ac0fb3d2dab8a6c583 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
>  6ea49b0c690d288ff59d1d4798144bfa2d153d3a 
> 
> Diff: https://reviews.apache.org/r/54288/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>



Re: Review Request 54288: Make leader elections resilient to ZK disconnections.

2016-12-11 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54288/#review158811
---


Ship it!




Master (c8e8953) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Dec. 12, 2016, 12:43 a.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54288/
> ---
> 
> (Updated Dec. 12, 2016, 12:43 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, John Sirois, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1669
> https://issues.apache.org/jira/browse/AURORA-1669
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
> leadership if the ZK connection is lost or if there is a timeout. This is not
> compatible with the commons based implementation which would only abdicate
> leadership if the ZK session timeout occurred.
> 
> This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
> custom listener that only loses leadership if a connection loss occurs.
> 
> 
> Diffs
> -
> 
>   
> commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java
>  50acaeba82e163f8f2970a264cbd889c9eb3b5ed 
>   
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java
>  c378172c850aafe0a9381552b5067277b40dbfab 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java
>  a2b4125369d1f6c0a79bc4ac0fb3d2dab8a6c583 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
>  6ea49b0c690d288ff59d1d4798144bfa2d153d3a 
> 
> Diff: https://reviews.apache.org/r/54288/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>



Re: Review Request 54288: Make leader elections resilient to ZK disconnections.

2016-12-11 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54288/
---

(Updated Dec. 11, 2016, 4:43 p.m.)


Review request for Aurora, David McLaughlin, Joshua Cohen, John Sirois, and 
Stephan Erb.


Changes
---

Fix more possible timing issues.


Bugs: AURORA-1669
https://issues.apache.org/jira/browse/AURORA-1669


Repository: aurora


Description
---

As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
leadership if the ZK connection is lost or if there is a timeout. This is not
compatible with the commons based implementation which would only abdicate
leadership if the ZK session timeout occurred.

This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
custom listener that only loses leadership if a connection loss occurs.


Diffs (updated)
-

  
commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java
 50acaeba82e163f8f2970a264cbd889c9eb3b5ed 
  
src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java
 c378172c850aafe0a9381552b5067277b40dbfab 
  
src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java
 a2b4125369d1f6c0a79bc4ac0fb3d2dab8a6c583 
  
src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
 6ea49b0c690d288ff59d1d4798144bfa2d153d3a 

Diff: https://reviews.apache.org/r/54288/diff/


Testing
---


Thanks,

Zameer Manji



Re: Review Request 54288: Make leader elections resilient to ZK disconnections.

2016-12-11 Thread Zameer Manji


> On Dec. 8, 2016, 3:42 p.m., Zameer Manji wrote:
> > Does anyone know how to get the test reports from jenkins or have an idea 
> > of what's going on?
> 
> John Sirois wrote:
> Yes, these are legit failures, no Jenkins logs needed, just this is 
> enough:
> ```
> org.apache.aurora.scheduler.discovery.CuratorSingletonServiceTest > 
> testAbdicateTransition FAILED
> java.lang.AssertionError at CuratorSingletonServiceTest.java:125
> java.lang.AssertionError
> 
> org.apache.aurora.scheduler.discovery.CuratorSingletonServiceTest > 
> testLeadAdvertise FAILED
> java.lang.AssertionError at CuratorSingletonServiceTest.java:94
> java.lang.AssertionError
> ```
> 
> I looked at the testLeadAdvertise one 1st which [blocks until an 
> ephemeral node is 
> added](https://github.com/apache/aurora/blob/master/src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java#L82-L83)
>  to the leadership group path as the signal that leadership has been already 
> synchronously obtained. The assumption is broken with LeaderSelector since 
> the transition is [fired on an executor 
> runnable](https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderSelector.java#L239)
>  and could be delayed arbitrarily past the test assertion.
> 
> In short, you're getting lucky having these 2 (and maybe more) pass on 
> your machine.  Needs careful re-review of the test infra and the new behavior 
> of the LeaderSelector using the thread pool to do leadership transitions 
> async.

Good catch.

The tests already have an `awaitCapture` and since I am adding a timeout, we 
can use that to ensure the listeners are fired instead of plain asserts.


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54288/#review158601
---


On Dec. 8, 2016, 3:28 p.m., Zameer Manji wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54288/
> ---
> 
> (Updated Dec. 8, 2016, 3:28 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, John Sirois, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1669
> https://issues.apache.org/jira/browse/AURORA-1669
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
> leadership if the ZK connection is lost or if there is a timeout. This is not
> compatible with the commons based implementation which would only abdicate
> leadership if the ZK session timeout occurred.
> 
> This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
> custom listener that only loses leadership if a connection loss occurs.
> 
> 
> Diffs
> -
> 
>   
> commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java
>  50acaeba82e163f8f2970a264cbd889c9eb3b5ed 
>   
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java
>  c378172c850aafe0a9381552b5067277b40dbfab 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java
>  a2b4125369d1f6c0a79bc4ac0fb3d2dab8a6c583 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
>  6ea49b0c690d288ff59d1d4798144bfa2d153d3a 
> 
> Diff: https://reviews.apache.org/r/54288/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>