Re: Review Request 29827: End to end tests for docker in aurora

2015-01-28 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29827/#review70026
---


I am getting an error when running e2e tests with this patch:
```
+ vagrant ssh -c 'cd /vagrant/src/test/sh/org/apache/aurora/e2e  docker build 
-t http_example .'
Sending build context to Docker daemon 
FATA[] Post http:///var/run/docker.sock/v1.16/build?rm=1t=http_example: 
dial unix /var/run/docker.sock: permission denied 
Connection to 127.0.0.1 closed.
+ collect_result
+ [[ 1 = 0 ]]
+ echo '!!!'
!!!
+ echo 'FAIL (something returned non-zero)'
FAIL (something returned non-zero)
+ echo ''

```
Anything I am missing?

- Maxim Khutornenko


On Jan. 27, 2015, 4:58 p.m., Steve Niemitz wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29827/
 ---
 
 (Updated Jan. 27, 2015, 4:58 p.m.)
 
 
 Review request for Aurora and Bill Farner.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 This adds an end to end test for docker.
 
 
 Diffs
 -
 
   src/test/sh/org/apache/aurora/e2e/Dockerfile PRE-CREATION 
   src/test/sh/org/apache/aurora/e2e/http/http_example_docker.aurora 
 PRE-CREATION 
   src/test/sh/org/apache/aurora/e2e/http/http_example_docker_updated.aurora 
 PRE-CREATION 
   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
 45da754341de52759d05a8960a9a978111f1e415 
 
 Diff: https://reviews.apache.org/r/29827/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Steve Niemitz
 




Re: Review Request 30187: Remove support for cluster metadata in YAML format.

2015-01-28 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30187/#review70079
---

Ship it!


Master (5902574) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing @ReviewBot 
retry

- Aurora ReviewBot


On Jan. 28, 2015, 8:26 p.m., Bill Farner wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30187/
 ---
 
 (Updated Jan. 28, 2015, 8:26 p.m.)
 
 
 Review request for Aurora, Brian Wickman and Zameer Manji.
 
 
 Bugs: AURORA-1029
 https://issues.apache.org/jira/browse/AURORA-1029
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Remove support for cluster metadata in YAML format.
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/common/clusters.py 
 e55aa774b4b868f696a7de51bb016f950871dd1e 
   src/test/python/apache/aurora/common/BUILD 
 14165b96be99b8de418f4bb8def9f27eaf29e67d 
   src/test/python/apache/aurora/common/test_clusters.py 
 45250e609cca1149dc296b2aaf645ff2f58f8288 
 
 Diff: https://reviews.apache.org/r/30187/diff/
 
 
 Testing
 ---
 
 ./build-support/jenkins/build.sh
 
 test_end_to_end.sh is currently broken on master, i will address that and 
 ensure it passes before committing this.
 
 
 Thanks,
 
 Bill Farner
 




Re: Review Request 29827: End to end tests for docker in aurora

2015-01-28 Thread Bill Farner

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29827/#review70082
---

Ship it!


Ship It!

- Bill Farner


On Jan. 28, 2015, 6:56 p.m., Steve Niemitz wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29827/
 ---
 
 (Updated Jan. 28, 2015, 6:56 p.m.)
 
 
 Review request for Aurora and Bill Farner.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 This adds an end to end test for docker.
 
 
 Diffs
 -
 
   src/test/sh/org/apache/aurora/e2e/Dockerfile PRE-CREATION 
   src/test/sh/org/apache/aurora/e2e/http/http_example_docker.aurora 
 PRE-CREATION 
   src/test/sh/org/apache/aurora/e2e/http/http_example_docker_updated.aurora 
 PRE-CREATION 
   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
 45da754341de52759d05a8960a9a978111f1e415 
 
 Diff: https://reviews.apache.org/r/29827/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Steve Niemitz
 




Re: Review Request 30187: Remove support for cluster metadata in YAML format.

2015-01-28 Thread Bill Farner

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30187/
---

(Updated Jan. 28, 2015, 8:26 p.m.)


Review request for Aurora, Brian Wickman and Zameer Manji.


Bugs: AURORA-1029
https://issues.apache.org/jira/browse/AURORA-1029


Repository: aurora


Description
---

Remove support for cluster metadata in YAML format.


Diffs (updated)
-

  src/main/python/apache/aurora/common/clusters.py 
e55aa774b4b868f696a7de51bb016f950871dd1e 
  src/test/python/apache/aurora/common/BUILD 
14165b96be99b8de418f4bb8def9f27eaf29e67d 
  src/test/python/apache/aurora/common/test_clusters.py 
45250e609cca1149dc296b2aaf645ff2f58f8288 

Diff: https://reviews.apache.org/r/30187/diff/


Testing
---

./build-support/jenkins/build.sh

test_end_to_end.sh is currently broken on master, i will address that and 
ensure it passes before committing this.


Thanks,

Bill Farner



Re: Review Request 30225: Modifying update controller to support heartbeats.

2015-01-28 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review70076
---

Ship it!


Looks reasonable to me (caveat being I'm not super familiar with the internals 
of the scheduler driven updates).

- Joshua Cohen


On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30225/
 ---
 
 (Updated Jan. 23, 2015, 8:37 p.m.)
 
 
 Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
 
 
 Bugs: AURORA-1010
 https://issues.apache.org/jira/browse/AURORA-1010
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added pulsing support into the JobUpdateController. The qualified coordinated 
 updates get blocked until a pulse arrives. An update then becomes active and 
 proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a 
 terminal state (whichever comes first).
 
 Not particularly happy with plumbing through OneWayJobUpdater but the 
 alternative is a state machine change, which is much hairier and will require 
 more changes in the JobUpdaterController. Going with the minimal diff here.
 
 
 Diffs
 -
 
   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
 d3b30d48b76d8d7c64cda006a34f7ed3296526f2 
   
 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
  a992938d4e12b20f81608be6bbdc24c0a211c3fd 
   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 
 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 
   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
 4c827b183a87b4d97774edbfaa960bd1c3de72a5 
   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 
 
 Diff: https://reviews.apache.org/r/30225/diff/
 
 
 Testing
 ---
 
 ./gradlew -Pq build
 
 
 Thanks,
 
 Maxim Khutornenko
 




Re: Review Request 30187: Remove support for cluster metadata in YAML format.

2015-01-28 Thread Bill Farner


 On Jan. 26, 2015, 8:16 p.m., Kevin Sweeney wrote:
  src/main/python/apache/aurora/common/clusters.py, line 42
  https://reviews.apache.org/r/30187/diff/1/?file=830286#file830286line42
 
  For the purposes of sheparding this review along would you consider 
  moving this to another change?
 
 Bill Farner wrote:
 Can you give more detail on the reasoning?  While i generally agree with 
 keeping logically-different changes separate, i don't think we should be 
 strictly opposed to cleaning up code in the immediate vicinity of a patch.
 
 Kevin Sweeney wrote:
 The patch that removes the style-checker annotations around this set of 
 files will need to be accompanied by a change to the checkstyle tool to 
 ignore or detect these violations. IMO that's big enough for a separate patch.

Thanks.  I had convinced myself that i ran checkstyle and it did not complain.  
Reverting, as it is indeed necessary.


- Bill


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30187/#review69663
---


On Jan. 22, 2015, 9:09 p.m., Bill Farner wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30187/
 ---
 
 (Updated Jan. 22, 2015, 9:09 p.m.)
 
 
 Review request for Aurora, Brian Wickman and Zameer Manji.
 
 
 Bugs: AURORA-1029
 https://issues.apache.org/jira/browse/AURORA-1029
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Remove support for cluster metadata in YAML format.
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/common/clusters.py 
 e55aa774b4b868f696a7de51bb016f950871dd1e 
   src/test/python/apache/aurora/common/BUILD 
 14165b96be99b8de418f4bb8def9f27eaf29e67d 
   src/test/python/apache/aurora/common/test_clusters.py 
 45250e609cca1149dc296b2aaf645ff2f58f8288 
 
 Diff: https://reviews.apache.org/r/30187/diff/
 
 
 Testing
 ---
 
 ./build-support/jenkins/build.sh
 
 test_end_to_end.sh is currently broken on master, i will address that and 
 ensure it passes before committing this.
 
 
 Thanks,
 
 Bill Farner
 




Re: Review Request 30187: Remove support for cluster metadata in YAML format.

2015-01-28 Thread Bill Farner


 On Jan. 26, 2015, 8:40 p.m., Brian Wickman wrote:
  src/main/python/apache/aurora/common/clusters.py, line 42
  https://reviews.apache.org/r/30187/diff/1/?file=830286#file830286line42
 
  wait -- the # noqa is necessary, otherwise checkstyle will fail.

Reverted


- Bill


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30187/#review69668
---


On Jan. 22, 2015, 9:09 p.m., Bill Farner wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30187/
 ---
 
 (Updated Jan. 22, 2015, 9:09 p.m.)
 
 
 Review request for Aurora, Brian Wickman and Zameer Manji.
 
 
 Bugs: AURORA-1029
 https://issues.apache.org/jira/browse/AURORA-1029
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Remove support for cluster metadata in YAML format.
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/common/clusters.py 
 e55aa774b4b868f696a7de51bb016f950871dd1e 
   src/test/python/apache/aurora/common/BUILD 
 14165b96be99b8de418f4bb8def9f27eaf29e67d 
   src/test/python/apache/aurora/common/test_clusters.py 
 45250e609cca1149dc296b2aaf645ff2f58f8288 
 
 Diff: https://reviews.apache.org/r/30187/diff/
 
 
 Testing
 ---
 
 ./build-support/jenkins/build.sh
 
 test_end_to_end.sh is currently broken on master, i will address that and 
 ensure it passes before committing this.
 
 
 Thanks,
 
 Bill Farner
 




Re: Review Request 30384: Added TellApart to list of aurora users.

2015-01-28 Thread Bill Farner

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30384/#review70107
---

Ship it!



README.md
https://reviews.apache.org/r/30384/#comment115092

 http

One of these things is not like the others :-)


- Bill Farner


On Jan. 28, 2015, 11:03 p.m., Steve Niemitz wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30384/
 ---
 
 (Updated Jan. 28, 2015, 11:03 p.m.)
 
 
 Review request for Aurora and Bill Farner.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added TellApart to list of aurora users.
 
 
 Diffs
 -
 
   README.md 61e0253b1f7a3958c1e8444a071d8364b4164fab 
 
 Diff: https://reviews.apache.org/r/30384/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Steve Niemitz
 




Re: Review Request 30384: Added TellApart to list of aurora users.

2015-01-28 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30384/#review70108
---

Ship it!


Master (51f0d8b) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing @ReviewBot 
retry

- Aurora ReviewBot


On Jan. 28, 2015, 11:03 p.m., Steve Niemitz wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30384/
 ---
 
 (Updated Jan. 28, 2015, 11:03 p.m.)
 
 
 Review request for Aurora and Bill Farner.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added TellApart to list of aurora users.
 
 
 Diffs
 -
 
   README.md 61e0253b1f7a3958c1e8444a071d8364b4164fab 
 
 Diff: https://reviews.apache.org/r/30384/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Steve Niemitz
 




Re: Review Request 30384: Added TellApart to list of aurora users.

2015-01-28 Thread Steve Niemitz

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30384/
---

(Updated Jan. 28, 2015, 11:38 p.m.)


Review request for Aurora and Bill Farner.


Changes
---

So many more Ss.


Repository: aurora


Description
---

Added TellApart to list of aurora users.


Diffs (updated)
-

  README.md 61e0253b1f7a3958c1e8444a071d8364b4164fab 

Diff: https://reviews.apache.org/r/30384/diff/


Testing
---


Thanks,

Steve Niemitz



Re: Review Request 30225: Modifying update controller to support heartbeats.

2015-01-28 Thread Maxim Khutornenko


 On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote:
  src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
   lines 259-263
  https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259
 
  I am unsure why this is being called inside pulse. Once pulse is 
  activated, only the absence of a pulse can modify the update, right? We 
  don't resume a paused update by receiving a pulse. 
  
  So surely the last pulse time would be checked externally to the method 
  that performs the pulse? 
  
  If we can remove this, you can get rid of the write lock completely 
  here, because all you need are strongly consistent reads (which we have) to 
  accurately update the cooridinatedUpdateStates map correctly.
 
 Maxim Khutornenko wrote:
 An update blocked (not PAUSED) due to a missed pulse can be unblocked by 
 a new pulse. This covers a few important design desisions:
 - An update can be created blocked by default awaiting for the first 
 pulse to start its progress;
 - An occasional network partition/delay will not require an explicit 
 external service operation to resume;
 - A scheduler restart is treated the same as initial update creation - an 
 update is rehydrated and waits for a pulse to resume;
 
 More details and scenarios here: 
 https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
 
 David McLaughlin wrote:
 How do we show to the user (via client output or UI) that the update is 
 currently blocked?

One possible way is described here: 
https://issues.apache.org/jira/browse/AURORA-1049


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review70058
---


On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30225/
 ---
 
 (Updated Jan. 23, 2015, 8:37 p.m.)
 
 
 Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
 
 
 Bugs: AURORA-1010
 https://issues.apache.org/jira/browse/AURORA-1010
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added pulsing support into the JobUpdateController. The qualified coordinated 
 updates get blocked until a pulse arrives. An update then becomes active and 
 proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a 
 terminal state (whichever comes first).
 
 Not particularly happy with plumbing through OneWayJobUpdater but the 
 alternative is a state machine change, which is much hairier and will require 
 more changes in the JobUpdaterController. Going with the minimal diff here.
 
 
 Diffs
 -
 
   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
 d3b30d48b76d8d7c64cda006a34f7ed3296526f2 
   
 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
  a992938d4e12b20f81608be6bbdc24c0a211c3fd 
   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 
 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 
   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
 4c827b183a87b4d97774edbfaa960bd1c3de72a5 
   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 
 
 Diff: https://reviews.apache.org/r/30225/diff/
 
 
 Testing
 ---
 
 ./gradlew -Pq build
 
 
 Thanks,
 
 Maxim Khutornenko
 




Re: Review Request 30383: Fix small typo in documentation.

2015-01-28 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30383/#review70106
---

Ship it!


Ship It!

- Zameer Manji


On Jan. 28, 2015, 3:20 p.m., Florian Pfeiffer wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30383/
 ---
 
 (Updated Jan. 28, 2015, 3:20 p.m.)
 
 
 Review request for Aurora and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Fix small typo in documentation.
 In the examples it's always db_team, so I think it should be db_team in 
 this sentence as well
 
 
 Diffs
 -
 
   docs/deploying-aurora-scheduler.md c14f865a844b6eb34cc0e9363dba50d4975f5632 
 
 Diff: https://reviews.apache.org/r/30383/diff/
 
 
 Testing
 ---
 
 none
 
 
 Thanks,
 
 Florian Pfeiffer
 




Re: Review Request 30225: Modifying update controller to support heartbeats.

2015-01-28 Thread David McLaughlin


 On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote:
  src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
   lines 259-263
  https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259
 
  I am unsure why this is being called inside pulse. Once pulse is 
  activated, only the absence of a pulse can modify the update, right? We 
  don't resume a paused update by receiving a pulse. 
  
  So surely the last pulse time would be checked externally to the method 
  that performs the pulse? 
  
  If we can remove this, you can get rid of the write lock completely 
  here, because all you need are strongly consistent reads (which we have) to 
  accurately update the cooridinatedUpdateStates map correctly.
 
 Maxim Khutornenko wrote:
 An update blocked (not PAUSED) due to a missed pulse can be unblocked by 
 a new pulse. This covers a few important design desisions:
 - An update can be created blocked by default awaiting for the first 
 pulse to start its progress;
 - An occasional network partition/delay will not require an explicit 
 external service operation to resume;
 - A scheduler restart is treated the same as initial update creation - an 
 update is rehydrated and waits for a pulse to resume;
 
 More details and scenarios here: 
 https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md

How do we show to the user (via client output or UI) that the update is 
currently blocked?


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review70058
---


On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30225/
 ---
 
 (Updated Jan. 23, 2015, 8:37 p.m.)
 
 
 Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
 
 
 Bugs: AURORA-1010
 https://issues.apache.org/jira/browse/AURORA-1010
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added pulsing support into the JobUpdateController. The qualified coordinated 
 updates get blocked until a pulse arrives. An update then becomes active and 
 proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a 
 terminal state (whichever comes first).
 
 Not particularly happy with plumbing through OneWayJobUpdater but the 
 alternative is a state machine change, which is much hairier and will require 
 more changes in the JobUpdaterController. Going with the minimal diff here.
 
 
 Diffs
 -
 
   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
 d3b30d48b76d8d7c64cda006a34f7ed3296526f2 
   
 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
  a992938d4e12b20f81608be6bbdc24c0a211c3fd 
   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 
 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 
   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
 4c827b183a87b4d97774edbfaa960bd1c3de72a5 
   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 
 
 Diff: https://reviews.apache.org/r/30225/diff/
 
 
 Testing
 ---
 
 ./gradlew -Pq build
 
 
 Thanks,
 
 Maxim Khutornenko
 




Re: Review Request 30225: Modifying update controller to support heartbeats.

2015-01-28 Thread David McLaughlin


 On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote:
  src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
   lines 259-263
  https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259
 
  I am unsure why this is being called inside pulse. Once pulse is 
  activated, only the absence of a pulse can modify the update, right? We 
  don't resume a paused update by receiving a pulse. 
  
  So surely the last pulse time would be checked externally to the method 
  that performs the pulse? 
  
  If we can remove this, you can get rid of the write lock completely 
  here, because all you need are strongly consistent reads (which we have) to 
  accurately update the cooridinatedUpdateStates map correctly.
 
 Maxim Khutornenko wrote:
 An update blocked (not PAUSED) due to a missed pulse can be unblocked by 
 a new pulse. This covers a few important design desisions:
 - An update can be created blocked by default awaiting for the first 
 pulse to start its progress;
 - An occasional network partition/delay will not require an explicit 
 external service operation to resume;
 - A scheduler restart is treated the same as initial update creation - an 
 update is rehydrated and waits for a pulse to resume;
 
 More details and scenarios here: 
 https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
 
 David McLaughlin wrote:
 How do we show to the user (via client output or UI) that the update is 
 currently blocked?
 
 Maxim Khutornenko wrote:
 One possible way is described here: 
 https://issues.apache.org/jira/browse/AURORA-1049
 
 David McLaughlin wrote:
 I don't think this is sufficient. We'd need auditing to explain to users 
 why an update was paused (blocked) for a given time, not just the current 
 status.
 
 Maxim Khutornenko wrote:
 That would require persistance and changing the actual status of the job. 
 I'd rather not introduce a new state that would only be applicable to 
 specific update configurations. The more important here is the visibility 
 into the internal transient state to troubleshoot a coordinated job unable 
 to make progress.

I don't follow the line of reasoning that 100% of updates have to use a feature 
for us to have a state to reflect it. I think in terms of simplicity and 
consistency, it makes way more sense to have an explicit 
UPDATE_WAITING_FOR_PULSE state that these updates are initialised in and moved 
to when blocked. A pulse could then have one valid state transition that would 
require a write: UPDATE_WAITING_FOR_PULSE - ROLLING_FORWARD. 

The drawbacks of this compared to this approach I think would be performance in 
the case of the external monitoring service flapping, or in the case of a 
scheduler failover. To combat this, we could also add some kind of initial 
grace period by setting a base timestamp at the point of leader election. This 
timestamp could be used in the should this be blocked? calculation when no 
previous pulse exists in the on-heap data structure.


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review70058
---


On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30225/
 ---
 
 (Updated Jan. 23, 2015, 8:37 p.m.)
 
 
 Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
 
 
 Bugs: AURORA-1010
 https://issues.apache.org/jira/browse/AURORA-1010
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added pulsing support into the JobUpdateController. The qualified coordinated 
 updates get blocked until a pulse arrives. An update then becomes active and 
 proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a 
 terminal state (whichever comes first).
 
 Not particularly happy with plumbing through OneWayJobUpdater but the 
 alternative is a state machine change, which is much hairier and will require 
 more changes in the JobUpdaterController. Going with the minimal diff here.
 
 
 Diffs
 -
 
   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
 d3b30d48b76d8d7c64cda006a34f7ed3296526f2 
   
 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
  a992938d4e12b20f81608be6bbdc24c0a211c3fd 
   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 
 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 
   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
 4c827b183a87b4d97774edbfaa960bd1c3de72a5 
   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 
 
 Diff: 

Re: Review Request 30384: Added TellApart to list of aurora users.

2015-01-28 Thread Steve Niemitz


 On Jan. 28, 2015, 11:31 p.m., Bill Farner wrote:
  README.md, line 65
  https://reviews.apache.org/r/30384/diff/1/?file=839312#file839312line65
 
   http
  
  One of these things is not like the others :-)

Our https site is just a 301 to http.  I can keep it https if you want though. 
:)


- Steve


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30384/#review70107
---


On Jan. 28, 2015, 11:03 p.m., Steve Niemitz wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30384/
 ---
 
 (Updated Jan. 28, 2015, 11:03 p.m.)
 
 
 Review request for Aurora and Bill Farner.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added TellApart to list of aurora users.
 
 
 Diffs
 -
 
   README.md 61e0253b1f7a3958c1e8444a071d8364b4164fab 
 
 Diff: https://reviews.apache.org/r/30384/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Steve Niemitz
 




Re: Review Request 30225: Modifying update controller to support heartbeats.

2015-01-28 Thread Maxim Khutornenko


 On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote:
  src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
   lines 259-263
  https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259
 
  I am unsure why this is being called inside pulse. Once pulse is 
  activated, only the absence of a pulse can modify the update, right? We 
  don't resume a paused update by receiving a pulse. 
  
  So surely the last pulse time would be checked externally to the method 
  that performs the pulse? 
  
  If we can remove this, you can get rid of the write lock completely 
  here, because all you need are strongly consistent reads (which we have) to 
  accurately update the cooridinatedUpdateStates map correctly.

An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new 
pulse. This covers a few important design desisions:
- An update can be created blocked by default awaiting for the first pulse to 
start its progress;
- An occasional network partition/delay will not require an explicit external 
service operation to resume;
- A scheduler restart is treated the same as initial update creation - an 
update is rehydrated and waits for a pulse to resume;

More details and scenarios here: 
https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review70058
---


On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30225/
 ---
 
 (Updated Jan. 23, 2015, 8:37 p.m.)
 
 
 Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
 
 
 Bugs: AURORA-1010
 https://issues.apache.org/jira/browse/AURORA-1010
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added pulsing support into the JobUpdateController. The qualified coordinated 
 updates get blocked until a pulse arrives. An update then becomes active and 
 proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a 
 terminal state (whichever comes first).
 
 Not particularly happy with plumbing through OneWayJobUpdater but the 
 alternative is a state machine change, which is much hairier and will require 
 more changes in the JobUpdaterController. Going with the minimal diff here.
 
 
 Diffs
 -
 
   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
 d3b30d48b76d8d7c64cda006a34f7ed3296526f2 
   
 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
  a992938d4e12b20f81608be6bbdc24c0a211c3fd 
   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 
 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 
   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
 4c827b183a87b4d97774edbfaa960bd1c3de72a5 
   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 
 
 Diff: https://reviews.apache.org/r/30225/diff/
 
 
 Testing
 ---
 
 ./gradlew -Pq build
 
 
 Thanks,
 
 Maxim Khutornenko
 




Re: Review Request 30225: Modifying update controller to support heartbeats.

2015-01-28 Thread David McLaughlin


 On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote:
  src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
   lines 259-263
  https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259
 
  I am unsure why this is being called inside pulse. Once pulse is 
  activated, only the absence of a pulse can modify the update, right? We 
  don't resume a paused update by receiving a pulse. 
  
  So surely the last pulse time would be checked externally to the method 
  that performs the pulse? 
  
  If we can remove this, you can get rid of the write lock completely 
  here, because all you need are strongly consistent reads (which we have) to 
  accurately update the cooridinatedUpdateStates map correctly.
 
 Maxim Khutornenko wrote:
 An update blocked (not PAUSED) due to a missed pulse can be unblocked by 
 a new pulse. This covers a few important design desisions:
 - An update can be created blocked by default awaiting for the first 
 pulse to start its progress;
 - An occasional network partition/delay will not require an explicit 
 external service operation to resume;
 - A scheduler restart is treated the same as initial update creation - an 
 update is rehydrated and waits for a pulse to resume;
 
 More details and scenarios here: 
 https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
 
 David McLaughlin wrote:
 How do we show to the user (via client output or UI) that the update is 
 currently blocked?
 
 Maxim Khutornenko wrote:
 One possible way is described here: 
 https://issues.apache.org/jira/browse/AURORA-1049

I don't think this is sufficient. We'd need auditing to explain to users why an 
update was paused (blocked) for a given time, not just the current status.


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review70058
---


On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30225/
 ---
 
 (Updated Jan. 23, 2015, 8:37 p.m.)
 
 
 Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
 
 
 Bugs: AURORA-1010
 https://issues.apache.org/jira/browse/AURORA-1010
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added pulsing support into the JobUpdateController. The qualified coordinated 
 updates get blocked until a pulse arrives. An update then becomes active and 
 proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a 
 terminal state (whichever comes first).
 
 Not particularly happy with plumbing through OneWayJobUpdater but the 
 alternative is a state machine change, which is much hairier and will require 
 more changes in the JobUpdaterController. Going with the minimal diff here.
 
 
 Diffs
 -
 
   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
 d3b30d48b76d8d7c64cda006a34f7ed3296526f2 
   
 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
  a992938d4e12b20f81608be6bbdc24c0a211c3fd 
   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 
 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 
   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
 4c827b183a87b4d97774edbfaa960bd1c3de72a5 
   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 
 
 Diff: https://reviews.apache.org/r/30225/diff/
 
 
 Testing
 ---
 
 ./gradlew -Pq build
 
 
 Thanks,
 
 Maxim Khutornenko
 




Re: Review Request 29286: Fix BUILD files in apache.aurora.admin and apache.aurora.client.api.

2015-01-28 Thread Brian Wickman

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29286/#review70132
---


What specifically is taking place here?  (I think I understand some of it, but 
I'm not sure what was broken or what is being improved.)

- Brian Wickman


On Dec. 20, 2014, 3:05 a.m., Kevin Sweeney wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29286/
 ---
 
 (Updated Dec. 20, 2014, 3:05 a.m.)
 
 
 Review request for Aurora and Brian Wickman.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Fix BUILD files in apache.aurora.admin and apache.aurora.client.api.
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/admin/BUILD 
 f874264bdf07a9cbb2f0990739be3c95f851b040 
   src/main/python/apache/aurora/admin/host_maintenance.py 
 8fa5182fae509493def7175635fbe1cb79fa3eec 
   src/main/python/apache/aurora/client/api/BUILD 
 65e5a85e23c4c698356c8b45c45943e560c1bcd5 
   src/main/python/apache/aurora/client/api/command_runner.py 
 48cb567c2098620e0ee322fe9528e167ce7c7c62 
   src/main/python/apache/aurora/client/api/disambiguator.py 
 6a78ccd44533ef327f751a08c9e2e16555354d97 
   src/test/python/apache/aurora/admin/BUILD 
 3a216809d1e31247f7d01451fcc7fd877a4c1fb2 
   src/test/python/apache/aurora/client/api/BUILD 
 2c0c4070cc1f1784b1d4e7f9cd8aac236e97be75 
   src/test/python/apache/aurora/client/api/test_task_util.py 
 048aff6874259810efea463df1ca2a1fdc419ca1 
 
 Diff: https://reviews.apache.org/r/29286/diff/
 
 
 Testing
 ---
 
 ./pants src/test/python:all
 
 
 Thanks,
 
 Kevin Sweeney
 




Re: Review Request 29286: Fix BUILD files in apache.aurora.admin and apache.aurora.client.api.

2015-01-28 Thread Bill Farner

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29286/#review70093
---


Is this patch still relevant?

- Bill Farner


On Dec. 20, 2014, 3:05 a.m., Kevin Sweeney wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29286/
 ---
 
 (Updated Dec. 20, 2014, 3:05 a.m.)
 
 
 Review request for Aurora and Brian Wickman.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Fix BUILD files in apache.aurora.admin and apache.aurora.client.api.
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/admin/BUILD 
 f874264bdf07a9cbb2f0990739be3c95f851b040 
   src/main/python/apache/aurora/admin/host_maintenance.py 
 8fa5182fae509493def7175635fbe1cb79fa3eec 
   src/main/python/apache/aurora/client/api/BUILD 
 65e5a85e23c4c698356c8b45c45943e560c1bcd5 
   src/main/python/apache/aurora/client/api/command_runner.py 
 48cb567c2098620e0ee322fe9528e167ce7c7c62 
   src/main/python/apache/aurora/client/api/disambiguator.py 
 6a78ccd44533ef327f751a08c9e2e16555354d97 
   src/test/python/apache/aurora/admin/BUILD 
 3a216809d1e31247f7d01451fcc7fd877a4c1fb2 
   src/test/python/apache/aurora/client/api/BUILD 
 2c0c4070cc1f1784b1d4e7f9cd8aac236e97be75 
   src/test/python/apache/aurora/client/api/test_task_util.py 
 048aff6874259810efea463df1ca2a1fdc419ca1 
 
 Diff: https://reviews.apache.org/r/29286/diff/
 
 
 Testing
 ---
 
 ./pants src/test/python:all
 
 
 Thanks,
 
 Kevin Sweeney
 




Re: Review Request 28731: Implemented TaskScheduler benchmarks.

2015-01-28 Thread Bill Farner

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28731/#review70095
---

Ship it!


LGTM.  I suspect we will learn more about how this scaffolding should look as 
we start submitting perf improvement patches with test cases here.

- Bill Farner


On Jan. 22, 2015, 1:54 a.m., Maxim Khutornenko wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/28731/
 ---
 
 (Updated Jan. 22, 2015, 1:54 a.m.)
 
 
 Review request for Aurora, Kevin Sweeney and Bill Farner.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added baseline benchmarks for a few static veto cases.
 
 
 Diffs
 -
 
   build.gradle f9f71a84493b782e9f6072e44e89a2c017cf2a09 
   src/jmh/java/org/apache/aurora/benchmark/Hosts.java PRE-CREATION 
   src/jmh/java/org/apache/aurora/benchmark/Offers.java PRE-CREATION 
   src/jmh/java/org/apache/aurora/benchmark/SchedulerBenchmark.java 
 5cecada93e4e04b689e826af49f691ed7e94ae49 
   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java 
 PRE-CREATION 
   src/jmh/java/org/apache/aurora/benchmark/Tasks.java PRE-CREATION 
   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeDriver.java PRE-CREATION 
   
 src/jmh/java/org/apache/aurora/benchmark/fakes/FakeRescheduleCalculator.java 
 PRE-CREATION 
   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeStatsProvider.java 
 PRE-CREATION 
   src/main/java/org/apache/aurora/scheduler/async/TaskScheduler.java 
 b6402ae42e3c7e4dca1c120fa6ef82d2d69e69d5 
   
 src/main/java/org/apache/aurora/scheduler/async/preemptor/CachedClusterState.java
  03c2a8fde4a3fe5ac73f44da2cbe227995501c46 
   src/main/java/org/apache/aurora/scheduler/async/preemptor/ClusterState.java 
 f7e157c890b5627411acd4bd5c2559ef4829147c 
   
 src/main/java/org/apache/aurora/scheduler/async/preemptor/PreemptorImpl.java 
 0204d14b19ae412236f19ca274d81decb4eba12d 
 
 Diff: https://reviews.apache.org/r/28731/diff/
 
 
 Testing
 ---
 
 Sample run on a local box:
 ```
 # VM invoker: 
 /Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/jre/bin/java
 # VM options: -Dfile.encoding=UTF-8 -Duser.country=US -Duser.language=en 
 -Duser.variant
 # Warmup: 10 iterations, 1 s each
 # Measurement: 100 iterations, 1 s each
 # Timeout: 10 min per iteration
 # Threads: 1 thread, will synchronize iterations
 # Benchmark mode: Average time, time/op
 # Benchmark: 
 org.apache.aurora.benchmark.SchedulingBenchmarks.ConstraintMismatchsSchedulingBenchmark.runBenchmark
 
 # Run progress: 0.00% complete, ETA 00:05:30
 # Fork: 1 of 1
 # Warmup Iteration   1: 278284250.000 ns/op
 # Warmup Iteration   2: 70294312.500 ns/op
 # Warmup Iteration   3: 19293379.310 ns/op
 # Warmup Iteration   4: 11945387.097 ns/op
 # Warmup Iteration   5: 10725388.350 ns/op
 # Warmup Iteration   6: 13043282.353 ns/op
 # Warmup Iteration   7: 9233083.333 ns/op
 # Warmup Iteration   8: 9521051.724 ns/op
 # Warmup Iteration   9: 10750854.369 ns/op
 # Warmup Iteration  10: 7460243.243 ns/op
 Iteration   1: 7885364.286 ns/op
 Iteration   2: 7735139.860 ns/op
 Iteration   3: 7660208.333 ns/op
 Iteration   4: 7761204.225 ns/op
 Iteration   5: 7868907.143 ns/op
 Iteration   6: 7567404.110 ns/op
 Iteration   7: 7611000.000 ns/op
 Iteration   8: 7766154.930 ns/op
 Iteration   9: 7669344.828 ns/op
 Iteration  10: 7707783.217 ns/op
 Iteration  11: 7435651.007 ns/op
 Iteration  12: 7697631.944 ns/op
 Iteration  13: 7712531.469 ns/op
 Iteration  14: 7899407.143 ns/op
 Iteration  15: 7448472.973 ns/op
 Iteration  16: 7791521.127 ns/op
 Iteration  17: 7612213.793 ns/op
 Iteration  18: 7710867.133 ns/op
 Iteration  19: 7649296.552 ns/op
 Iteration  20: 7768309.859 ns/op
 Iteration  21: 7688666.667 ns/op
 Iteration  22: 7531557.823 ns/op
 Iteration  23: 7381193.333 ns/op
 Iteration  24: 7726006.993 ns/op
 Iteration  25: 7603358.621 ns/op
 Iteration  26: 7653631.944 ns/op
 Iteration  27: 7442275.168 ns/op
 Iteration  28: 7613186.207 ns/op
 Iteration  29: 7765823.944 ns/op
 Iteration  30: 7489687.075 ns/op
 Iteration  31: 7811443.662 ns/op
 Iteration  32: 8015007.246 ns/op
 Iteration  33: 8192392.593 ns/op
 Iteration  34: 8040335.766 ns/op
 Iteration  35: 7584212.329 ns/op
 Iteration  36: 8001934.783 ns/op
 Iteration  37: 9744815.789 ns/op
 Iteration  38: 11688284.211 ns/op
 Iteration  39: 8661406.250 ns/op
 Iteration  40: 7678413.793 ns/op
 Iteration  41: 8502223.077 ns/op
 Iteration  42: 7640820.690 ns/op
 Iteration  43: 7875624.113 ns/op
 Iteration  44: 7506809.524 ns/op
 Iteration  45: 8005431.655 ns/op
 Iteration  46: 8081664.234 ns/op
 Iteration  47: 7579438.356 ns/op
 Iteration  48: 7993405.797 ns/op
 Iteration  49: 7571958.904 ns/op
 Iteration  50: 8116463.235 ns/op
 Iteration  51: 7941330.935 ns/op
 Iteration  52: 

Review Request 30384: Added TellApart to list of aurora users.

2015-01-28 Thread Steve Niemitz

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30384/
---

Review request for Aurora and Bill Farner.


Repository: aurora


Description
---

Added TellApart to list of aurora users.


Diffs
-

  README.md 61e0253b1f7a3958c1e8444a071d8364b4164fab 

Diff: https://reviews.apache.org/r/30384/diff/


Testing
---


Thanks,

Steve Niemitz



Re: Review Request 30225: Modifying update controller to support heartbeats.

2015-01-28 Thread David McLaughlin


 On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote:
  src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
   lines 259-263
  https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259
 
  I am unsure why this is being called inside pulse. Once pulse is 
  activated, only the absence of a pulse can modify the update, right? We 
  don't resume a paused update by receiving a pulse. 
  
  So surely the last pulse time would be checked externally to the method 
  that performs the pulse? 
  
  If we can remove this, you can get rid of the write lock completely 
  here, because all you need are strongly consistent reads (which we have) to 
  accurately update the cooridinatedUpdateStates map correctly.
 
 Maxim Khutornenko wrote:
 An update blocked (not PAUSED) due to a missed pulse can be unblocked by 
 a new pulse. This covers a few important design desisions:
 - An update can be created blocked by default awaiting for the first 
 pulse to start its progress;
 - An occasional network partition/delay will not require an explicit 
 external service operation to resume;
 - A scheduler restart is treated the same as initial update creation - an 
 update is rehydrated and waits for a pulse to resume;
 
 More details and scenarios here: 
 https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
 
 David McLaughlin wrote:
 How do we show to the user (via client output or UI) that the update is 
 currently blocked?
 
 Maxim Khutornenko wrote:
 One possible way is described here: 
 https://issues.apache.org/jira/browse/AURORA-1049
 
 David McLaughlin wrote:
 I don't think this is sufficient. We'd need auditing to explain to users 
 why an update was paused (blocked) for a given time, not just the current 
 status.
 
 Maxim Khutornenko wrote:
 That would require persistance and changing the actual status of the job. 
 I'd rather not introduce a new state that would only be applicable to 
 specific update configurations. The more important here is the visibility 
 into the internal transient state to troubleshoot a coordinated job unable 
 to make progress.
 
 David McLaughlin wrote:
 I don't follow the line of reasoning that 100% of updates have to use a 
 feature for us to have a state to reflect it. I think in terms of simplicity 
 and consistency, it makes way more sense to have an explicit 
 UPDATE_WAITING_FOR_PULSE state that these updates are initialised in and 
 moved to when blocked. A pulse could then have one valid state transition 
 that would require a write: UPDATE_WAITING_FOR_PULSE - ROLLING_FORWARD. 
 
 The drawbacks of this compared to this approach I think would be 
 performance in the case of the external monitoring service flapping, or in 
 the case of a scheduler failover. To combat this, we could also add some kind 
 of initial grace period by setting a base timestamp at the point of leader 
 election. This timestamp could be used in the should this be blocked? 
 calculation when no previous pulse exists in the on-heap data structure.
 
 Maxim Khutornenko wrote:
 The consensus we reached on the dev list [1] is to implement the first 
 version of the feature without any persistence at all. We agreed to look into 
 reacher status reporting later as/if needed. 
 
 IMO, adding a new state is a more involved solution. We would need to 
 revisit the update state graph, figure out and validate the new rules for 
 user pause/resume actions wrt the new state and pulse. Not impossible but 
 hardly a blocking requirement either.
 
 [1] - 
 http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser

On that email thread, my last comment is to bring up this exact point :)

I'm -1 on proceeding without auditing for blocked state. But if the majority 
disagrees, then that's fine and I'm fine for this to proceed with another ship 
it.


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review70058
---


On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30225/
 ---
 
 (Updated Jan. 23, 2015, 8:37 p.m.)
 
 
 Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
 
 
 Bugs: AURORA-1010
 https://issues.apache.org/jira/browse/AURORA-1010
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added pulsing support into the JobUpdateController. The qualified coordinated 
 updates get blocked until a pulse arrives. An update then becomes active and 
 proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a 
 terminal 

Re: Review Request 30225: Modifying update controller to support heartbeats.

2015-01-28 Thread Maxim Khutornenko


 On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote:
  src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
   lines 259-263
  https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259
 
  I am unsure why this is being called inside pulse. Once pulse is 
  activated, only the absence of a pulse can modify the update, right? We 
  don't resume a paused update by receiving a pulse. 
  
  So surely the last pulse time would be checked externally to the method 
  that performs the pulse? 
  
  If we can remove this, you can get rid of the write lock completely 
  here, because all you need are strongly consistent reads (which we have) to 
  accurately update the cooridinatedUpdateStates map correctly.
 
 Maxim Khutornenko wrote:
 An update blocked (not PAUSED) due to a missed pulse can be unblocked by 
 a new pulse. This covers a few important design desisions:
 - An update can be created blocked by default awaiting for the first 
 pulse to start its progress;
 - An occasional network partition/delay will not require an explicit 
 external service operation to resume;
 - A scheduler restart is treated the same as initial update creation - an 
 update is rehydrated and waits for a pulse to resume;
 
 More details and scenarios here: 
 https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
 
 David McLaughlin wrote:
 How do we show to the user (via client output or UI) that the update is 
 currently blocked?
 
 Maxim Khutornenko wrote:
 One possible way is described here: 
 https://issues.apache.org/jira/browse/AURORA-1049
 
 David McLaughlin wrote:
 I don't think this is sufficient. We'd need auditing to explain to users 
 why an update was paused (blocked) for a given time, not just the current 
 status.
 
 Maxim Khutornenko wrote:
 That would require persistance and changing the actual status of the job. 
 I'd rather not introduce a new state that would only be applicable to 
 specific update configurations. The more important here is the visibility 
 into the internal transient state to troubleshoot a coordinated job unable 
 to make progress.
 
 David McLaughlin wrote:
 I don't follow the line of reasoning that 100% of updates have to use a 
 feature for us to have a state to reflect it. I think in terms of simplicity 
 and consistency, it makes way more sense to have an explicit 
 UPDATE_WAITING_FOR_PULSE state that these updates are initialised in and 
 moved to when blocked. A pulse could then have one valid state transition 
 that would require a write: UPDATE_WAITING_FOR_PULSE - ROLLING_FORWARD. 
 
 The drawbacks of this compared to this approach I think would be 
 performance in the case of the external monitoring service flapping, or in 
 the case of a scheduler failover. To combat this, we could also add some kind 
 of initial grace period by setting a base timestamp at the point of leader 
 election. This timestamp could be used in the should this be blocked? 
 calculation when no previous pulse exists in the on-heap data structure.

The consensus we reached on the dev list [1] is to implement the first version 
of the feature without any persistence at all. We agreed to look into reacher 
status reporting later as/if needed. 

IMO, adding a new state is a more involved solution. We would need to revisit 
the update state graph, figure out and validate the new rules for user 
pause/resume actions wrt the new state and pulse. Not impossible but hardly a 
blocking requirement either.

[1] - 
http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review70058
---


On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30225/
 ---
 
 (Updated Jan. 23, 2015, 8:37 p.m.)
 
 
 Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
 
 
 Bugs: AURORA-1010
 https://issues.apache.org/jira/browse/AURORA-1010
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added pulsing support into the JobUpdateController. The qualified coordinated 
 updates get blocked until a pulse arrives. An update then becomes active and 
 proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a 
 terminal state (whichever comes first).
 
 Not particularly happy with plumbing through OneWayJobUpdater but the 
 alternative is a state machine change, which is much hairier and will require 
 more changes in the JobUpdaterController. Going with the minimal diff here.
 
 
 Diffs
 -
 
   

Re: Review Request 30207: Simplify AuroraCommandContext

2015-01-28 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30207/#review70043
---


Bill, can I get a ship it here?

- Zameer Manji


On Jan. 22, 2015, 7:32 p.m., Zameer Manji wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30207/
 ---
 
 (Updated Jan. 22, 2015, 7:32 p.m.)
 
 
 Review request for Aurora, Maxim Khutornenko and Bill Farner.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The AuroraCommandContext class is used in multiple commands and contains 
 common code for all of them. However some portions are only used by one 
 command. This patch takes some of those portions and moves them to the 
 command that requires that functionality.
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/client/cli/context.py 
 51c7d24dca664e476e62f1864d095416dfab70e4 
   src/main/python/apache/aurora/client/cli/jobs.py 
 8f349c09637c16e2499e85f2dc96eb7ccffd0aaf 
   src/main/python/apache/aurora/client/cli/update.py PRE-CREATION 
   src/test/python/apache/aurora/client/cli/test_supdate.py PRE-CREATION 
   src/test/python/apache/aurora/client/cli/test_update.py 
 8b7d11202b35deb09a248cfe0a96458fb70c 
 
 Diff: https://reviews.apache.org/r/30207/diff/
 
 
 Testing
 ---
 
 ./pants test.pytest --no-fast src/test/python/apache/aurora/client::
 
 
 Thanks,
 
 Zameer Manji
 




Re: Review Request 30207: Simplify AuroraCommandContext

2015-01-28 Thread Bill Farner

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30207/#review70046
---

Ship it!


Ship It!

- Bill Farner


On Jan. 23, 2015, 3:32 a.m., Zameer Manji wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30207/
 ---
 
 (Updated Jan. 23, 2015, 3:32 a.m.)
 
 
 Review request for Aurora, Maxim Khutornenko and Bill Farner.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The AuroraCommandContext class is used in multiple commands and contains 
 common code for all of them. However some portions are only used by one 
 command. This patch takes some of those portions and moves them to the 
 command that requires that functionality.
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/client/cli/context.py 
 51c7d24dca664e476e62f1864d095416dfab70e4 
   src/main/python/apache/aurora/client/cli/jobs.py 
 8f349c09637c16e2499e85f2dc96eb7ccffd0aaf 
   src/main/python/apache/aurora/client/cli/update.py PRE-CREATION 
   src/test/python/apache/aurora/client/cli/test_supdate.py PRE-CREATION 
   src/test/python/apache/aurora/client/cli/test_update.py 
 8b7d11202b35deb09a248cfe0a96458fb70c 
 
 Diff: https://reviews.apache.org/r/30207/diff/
 
 
 Testing
 ---
 
 ./pants test.pytest --no-fast src/test/python/apache/aurora/client::
 
 
 Thanks,
 
 Zameer Manji
 




Re: Review Request 29827: End to end tests for docker in aurora

2015-01-28 Thread Steve Niemitz

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29827/
---

(Updated Jan. 28, 2015, 6:56 p.m.)


Review request for Aurora and Bill Farner.


Repository: aurora


Description
---

This adds an end to end test for docker.


Diffs (updated)
-

  src/test/sh/org/apache/aurora/e2e/Dockerfile PRE-CREATION 
  src/test/sh/org/apache/aurora/e2e/http/http_example_docker.aurora 
PRE-CREATION 
  src/test/sh/org/apache/aurora/e2e/http/http_example_docker_updated.aurora 
PRE-CREATION 
  src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
45da754341de52759d05a8960a9a978111f1e415 

Diff: https://reviews.apache.org/r/29827/diff/


Testing
---


Thanks,

Steve Niemitz



Re: Review Request 29827: End to end tests for docker in aurora

2015-01-28 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29827/#review70055
---

Ship it!


Master (46a0466) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing @ReviewBot 
retry

- Aurora ReviewBot


On Jan. 28, 2015, 6:56 p.m., Steve Niemitz wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/29827/
 ---
 
 (Updated Jan. 28, 2015, 6:56 p.m.)
 
 
 Review request for Aurora and Bill Farner.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 This adds an end to end test for docker.
 
 
 Diffs
 -
 
   src/test/sh/org/apache/aurora/e2e/Dockerfile PRE-CREATION 
   src/test/sh/org/apache/aurora/e2e/http/http_example_docker.aurora 
 PRE-CREATION 
   src/test/sh/org/apache/aurora/e2e/http/http_example_docker_updated.aurora 
 PRE-CREATION 
   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
 45da754341de52759d05a8960a9a978111f1e415 
 
 Diff: https://reviews.apache.org/r/29827/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Steve Niemitz
 




Re: Review Request 30225: Modifying update controller to support heartbeats.

2015-01-28 Thread David McLaughlin


 On Jan. 27, 2015, 5:49 p.m., Joshua Cohen wrote:
  src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
   line 227
  https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line227
 
  Given how sensitive we are to storage lock contention, is there any 
  work in this block that can be done outside of the lock? From a casual 
  glance the only potentially time-consuming (non-write) operation is the 
  fetch from the job update store, but I'm not sure how intensive an 
  operation that will be (or the impact of performing it outside of the 
  context of a lock).
 
 Maxim Khutornenko wrote:
 Moving a fetch call out of the write transaction would create a potential 
 for a race that may compromise the feature and create hard to reason/trace 
 corner cases. This is not worth the arguable gain in contention improvement 
 in my opinion.
 
 David McLaughlin wrote:
 AFAICT the worst race condition is sending a heartbeat for a job that is 
 already complete. Are there other race conditions we are protecting against? 
 
 Also, I don't see the value in persisting the heartbeats. What is the 
 rationale for this?
 
 Maxim Khutornenko wrote:
 AFAICT the worst race condition is sending a heartbeat for a job that is 
 already complete. Are there other race conditions we are protecting against? 
 - anything inside the evaluation logic is sensitive to the data provided in 
 the details pull. This ranges from an invalid OK request sent back for a 
 completed update to an internal updater failure due to outdated expectations. 
 This pattern of querying inside of a write transaction is common in the 
 updater (e.g. look at changeJobUpdateStatus() called any time an update is 
 evaluated) and I'd rather not second-guess the correctness of the split.
 
 Also, I don't see the value in persisting the heartbeats. What is the 
 rationale for this? - what makes you think they are persisted? The 
 heartbeats are stored in-memory only (look for coordinatedUpdateStates map).

Thanks for explaining, the presence of write lock threw me off and I misread 
the code.


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review69834
---


On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30225/
 ---
 
 (Updated Jan. 23, 2015, 8:37 p.m.)
 
 
 Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
 
 
 Bugs: AURORA-1010
 https://issues.apache.org/jira/browse/AURORA-1010
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added pulsing support into the JobUpdateController. The qualified coordinated 
 updates get blocked until a pulse arrives. An update then becomes active and 
 proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a 
 terminal state (whichever comes first).
 
 Not particularly happy with plumbing through OneWayJobUpdater but the 
 alternative is a state machine change, which is much hairier and will require 
 more changes in the JobUpdaterController. Going with the minimal diff here.
 
 
 Diffs
 -
 
   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
 d3b30d48b76d8d7c64cda006a34f7ed3296526f2 
   
 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
  a992938d4e12b20f81608be6bbdc24c0a211c3fd 
   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 
 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 
   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
 4c827b183a87b4d97774edbfaa960bd1c3de72a5 
   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 
 
 Diff: https://reviews.apache.org/r/30225/diff/
 
 
 Testing
 ---
 
 ./gradlew -Pq build
 
 
 Thanks,
 
 Maxim Khutornenko
 




Re: Review Request 30225: Modifying update controller to support heartbeats.

2015-01-28 Thread David McLaughlin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review70058
---



src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
https://reviews.apache.org/r/30225/#comment115014

I am unsure why this is being called inside pulse. Once pulse is activated, 
only the absence of a pulse can modify the update, right? We don't resume a 
paused update by receiving a pulse. 

So surely the last pulse time would be checked externally to the method 
that performs the pulse? 

If we can remove this, you can get rid of the write lock completely here, 
because all you need are strongly consistent reads (which we have) to 
accurately update the cooridinatedUpdateStates map correctly.


- David McLaughlin


On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30225/
 ---
 
 (Updated Jan. 23, 2015, 8:37 p.m.)
 
 
 Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
 
 
 Bugs: AURORA-1010
 https://issues.apache.org/jira/browse/AURORA-1010
 
 
 Repository: aurora
 
 
 Description
 ---
 
 Added pulsing support into the JobUpdateController. The qualified coordinated 
 updates get blocked until a pulse arrives. An update then becomes active and 
 proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a 
 terminal state (whichever comes first).
 
 Not particularly happy with plumbing through OneWayJobUpdater but the 
 alternative is a state machine change, which is much hairier and will require 
 more changes in the JobUpdaterController. Going with the minimal diff here.
 
 
 Diffs
 -
 
   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
 d3b30d48b76d8d7c64cda006a34f7ed3296526f2 
   
 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
  a992938d4e12b20f81608be6bbdc24c0a211c3fd 
   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 
 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 
   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
 4c827b183a87b4d97774edbfaa960bd1c3de72a5 
   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 
 
 Diff: https://reviews.apache.org/r/30225/diff/
 
 
 Testing
 ---
 
 ./gradlew -Pq build
 
 
 Thanks,
 
 Maxim Khutornenko