Re: Review Request 29827: End to end tests for docker in aurora
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29827/#review70026 --- I am getting an error when running e2e tests with this patch: ``` + vagrant ssh -c 'cd /vagrant/src/test/sh/org/apache/aurora/e2e docker build -t http_example .' Sending build context to Docker daemon FATA[] Post http:///var/run/docker.sock/v1.16/build?rm=1t=http_example: dial unix /var/run/docker.sock: permission denied Connection to 127.0.0.1 closed. + collect_result + [[ 1 = 0 ]] + echo '!!!' !!! + echo 'FAIL (something returned non-zero)' FAIL (something returned non-zero) + echo '' ``` Anything I am missing? - Maxim Khutornenko On Jan. 27, 2015, 4:58 p.m., Steve Niemitz wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29827/ --- (Updated Jan. 27, 2015, 4:58 p.m.) Review request for Aurora and Bill Farner. Repository: aurora Description --- This adds an end to end test for docker. Diffs - src/test/sh/org/apache/aurora/e2e/Dockerfile PRE-CREATION src/test/sh/org/apache/aurora/e2e/http/http_example_docker.aurora PRE-CREATION src/test/sh/org/apache/aurora/e2e/http/http_example_docker_updated.aurora PRE-CREATION src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 45da754341de52759d05a8960a9a978111f1e415 Diff: https://reviews.apache.org/r/29827/diff/ Testing --- Thanks, Steve Niemitz
Re: Review Request 30187: Remove support for cluster metadata in YAML format.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30187/#review70079 --- Ship it! Master (5902574) is green with this patch. ./build-support/jenkins/build.sh I will refresh this build result if you post a review containing @ReviewBot retry - Aurora ReviewBot On Jan. 28, 2015, 8:26 p.m., Bill Farner wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30187/ --- (Updated Jan. 28, 2015, 8:26 p.m.) Review request for Aurora, Brian Wickman and Zameer Manji. Bugs: AURORA-1029 https://issues.apache.org/jira/browse/AURORA-1029 Repository: aurora Description --- Remove support for cluster metadata in YAML format. Diffs - src/main/python/apache/aurora/common/clusters.py e55aa774b4b868f696a7de51bb016f950871dd1e src/test/python/apache/aurora/common/BUILD 14165b96be99b8de418f4bb8def9f27eaf29e67d src/test/python/apache/aurora/common/test_clusters.py 45250e609cca1149dc296b2aaf645ff2f58f8288 Diff: https://reviews.apache.org/r/30187/diff/ Testing --- ./build-support/jenkins/build.sh test_end_to_end.sh is currently broken on master, i will address that and ensure it passes before committing this. Thanks, Bill Farner
Re: Review Request 29827: End to end tests for docker in aurora
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29827/#review70082 --- Ship it! Ship It! - Bill Farner On Jan. 28, 2015, 6:56 p.m., Steve Niemitz wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29827/ --- (Updated Jan. 28, 2015, 6:56 p.m.) Review request for Aurora and Bill Farner. Repository: aurora Description --- This adds an end to end test for docker. Diffs - src/test/sh/org/apache/aurora/e2e/Dockerfile PRE-CREATION src/test/sh/org/apache/aurora/e2e/http/http_example_docker.aurora PRE-CREATION src/test/sh/org/apache/aurora/e2e/http/http_example_docker_updated.aurora PRE-CREATION src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 45da754341de52759d05a8960a9a978111f1e415 Diff: https://reviews.apache.org/r/29827/diff/ Testing --- Thanks, Steve Niemitz
Re: Review Request 30187: Remove support for cluster metadata in YAML format.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30187/ --- (Updated Jan. 28, 2015, 8:26 p.m.) Review request for Aurora, Brian Wickman and Zameer Manji. Bugs: AURORA-1029 https://issues.apache.org/jira/browse/AURORA-1029 Repository: aurora Description --- Remove support for cluster metadata in YAML format. Diffs (updated) - src/main/python/apache/aurora/common/clusters.py e55aa774b4b868f696a7de51bb016f950871dd1e src/test/python/apache/aurora/common/BUILD 14165b96be99b8de418f4bb8def9f27eaf29e67d src/test/python/apache/aurora/common/test_clusters.py 45250e609cca1149dc296b2aaf645ff2f58f8288 Diff: https://reviews.apache.org/r/30187/diff/ Testing --- ./build-support/jenkins/build.sh test_end_to_end.sh is currently broken on master, i will address that and ensure it passes before committing this. Thanks, Bill Farner
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70076 --- Ship it! Looks reasonable to me (caveat being I'm not super familiar with the internals of the scheduler driven updates). - Joshua Cohen On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30187: Remove support for cluster metadata in YAML format.
On Jan. 26, 2015, 8:16 p.m., Kevin Sweeney wrote: src/main/python/apache/aurora/common/clusters.py, line 42 https://reviews.apache.org/r/30187/diff/1/?file=830286#file830286line42 For the purposes of sheparding this review along would you consider moving this to another change? Bill Farner wrote: Can you give more detail on the reasoning? While i generally agree with keeping logically-different changes separate, i don't think we should be strictly opposed to cleaning up code in the immediate vicinity of a patch. Kevin Sweeney wrote: The patch that removes the style-checker annotations around this set of files will need to be accompanied by a change to the checkstyle tool to ignore or detect these violations. IMO that's big enough for a separate patch. Thanks. I had convinced myself that i ran checkstyle and it did not complain. Reverting, as it is indeed necessary. - Bill --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30187/#review69663 --- On Jan. 22, 2015, 9:09 p.m., Bill Farner wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30187/ --- (Updated Jan. 22, 2015, 9:09 p.m.) Review request for Aurora, Brian Wickman and Zameer Manji. Bugs: AURORA-1029 https://issues.apache.org/jira/browse/AURORA-1029 Repository: aurora Description --- Remove support for cluster metadata in YAML format. Diffs - src/main/python/apache/aurora/common/clusters.py e55aa774b4b868f696a7de51bb016f950871dd1e src/test/python/apache/aurora/common/BUILD 14165b96be99b8de418f4bb8def9f27eaf29e67d src/test/python/apache/aurora/common/test_clusters.py 45250e609cca1149dc296b2aaf645ff2f58f8288 Diff: https://reviews.apache.org/r/30187/diff/ Testing --- ./build-support/jenkins/build.sh test_end_to_end.sh is currently broken on master, i will address that and ensure it passes before committing this. Thanks, Bill Farner
Re: Review Request 30187: Remove support for cluster metadata in YAML format.
On Jan. 26, 2015, 8:40 p.m., Brian Wickman wrote: src/main/python/apache/aurora/common/clusters.py, line 42 https://reviews.apache.org/r/30187/diff/1/?file=830286#file830286line42 wait -- the # noqa is necessary, otherwise checkstyle will fail. Reverted - Bill --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30187/#review69668 --- On Jan. 22, 2015, 9:09 p.m., Bill Farner wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30187/ --- (Updated Jan. 22, 2015, 9:09 p.m.) Review request for Aurora, Brian Wickman and Zameer Manji. Bugs: AURORA-1029 https://issues.apache.org/jira/browse/AURORA-1029 Repository: aurora Description --- Remove support for cluster metadata in YAML format. Diffs - src/main/python/apache/aurora/common/clusters.py e55aa774b4b868f696a7de51bb016f950871dd1e src/test/python/apache/aurora/common/BUILD 14165b96be99b8de418f4bb8def9f27eaf29e67d src/test/python/apache/aurora/common/test_clusters.py 45250e609cca1149dc296b2aaf645ff2f58f8288 Diff: https://reviews.apache.org/r/30187/diff/ Testing --- ./build-support/jenkins/build.sh test_end_to_end.sh is currently broken on master, i will address that and ensure it passes before committing this. Thanks, Bill Farner
Re: Review Request 30384: Added TellApart to list of aurora users.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30384/#review70107 --- Ship it! README.md https://reviews.apache.org/r/30384/#comment115092 http One of these things is not like the others :-) - Bill Farner On Jan. 28, 2015, 11:03 p.m., Steve Niemitz wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30384/ --- (Updated Jan. 28, 2015, 11:03 p.m.) Review request for Aurora and Bill Farner. Repository: aurora Description --- Added TellApart to list of aurora users. Diffs - README.md 61e0253b1f7a3958c1e8444a071d8364b4164fab Diff: https://reviews.apache.org/r/30384/diff/ Testing --- Thanks, Steve Niemitz
Re: Review Request 30384: Added TellApart to list of aurora users.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30384/#review70108 --- Ship it! Master (51f0d8b) is green with this patch. ./build-support/jenkins/build.sh I will refresh this build result if you post a review containing @ReviewBot retry - Aurora ReviewBot On Jan. 28, 2015, 11:03 p.m., Steve Niemitz wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30384/ --- (Updated Jan. 28, 2015, 11:03 p.m.) Review request for Aurora and Bill Farner. Repository: aurora Description --- Added TellApart to list of aurora users. Diffs - README.md 61e0253b1f7a3958c1e8444a071d8364b4164fab Diff: https://reviews.apache.org/r/30384/diff/ Testing --- Thanks, Steve Niemitz
Re: Review Request 30384: Added TellApart to list of aurora users.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30384/ --- (Updated Jan. 28, 2015, 11:38 p.m.) Review request for Aurora and Bill Farner. Changes --- So many more Ss. Repository: aurora Description --- Added TellApart to list of aurora users. Diffs (updated) - README.md 61e0253b1f7a3958c1e8444a071d8364b4164fab Diff: https://reviews.apache.org/r/30384/diff/ Testing --- Thanks, Steve Niemitz
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. Maxim Khutornenko wrote: An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md David McLaughlin wrote: How do we show to the user (via client output or UI) that the update is currently blocked? One possible way is described here: https://issues.apache.org/jira/browse/AURORA-1049 - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30383: Fix small typo in documentation.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30383/#review70106 --- Ship it! Ship It! - Zameer Manji On Jan. 28, 2015, 3:20 p.m., Florian Pfeiffer wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30383/ --- (Updated Jan. 28, 2015, 3:20 p.m.) Review request for Aurora and Zameer Manji. Repository: aurora Description --- Fix small typo in documentation. In the examples it's always db_team, so I think it should be db_team in this sentence as well Diffs - docs/deploying-aurora-scheduler.md c14f865a844b6eb34cc0e9363dba50d4975f5632 Diff: https://reviews.apache.org/r/30383/diff/ Testing --- none Thanks, Florian Pfeiffer
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. Maxim Khutornenko wrote: An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md How do we show to the user (via client output or UI) that the update is currently blocked? - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. Maxim Khutornenko wrote: An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md David McLaughlin wrote: How do we show to the user (via client output or UI) that the update is currently blocked? Maxim Khutornenko wrote: One possible way is described here: https://issues.apache.org/jira/browse/AURORA-1049 David McLaughlin wrote: I don't think this is sufficient. We'd need auditing to explain to users why an update was paused (blocked) for a given time, not just the current status. Maxim Khutornenko wrote: That would require persistance and changing the actual status of the job. I'd rather not introduce a new state that would only be applicable to specific update configurations. The more important here is the visibility into the internal transient state to troubleshoot a coordinated job unable to make progress. I don't follow the line of reasoning that 100% of updates have to use a feature for us to have a state to reflect it. I think in terms of simplicity and consistency, it makes way more sense to have an explicit UPDATE_WAITING_FOR_PULSE state that these updates are initialised in and moved to when blocked. A pulse could then have one valid state transition that would require a write: UPDATE_WAITING_FOR_PULSE - ROLLING_FORWARD. The drawbacks of this compared to this approach I think would be performance in the case of the external monitoring service flapping, or in the case of a scheduler failover. To combat this, we could also add some kind of initial grace period by setting a base timestamp at the point of leader election. This timestamp could be used in the should this be blocked? calculation when no previous pulse exists in the on-heap data structure. - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff:
Re: Review Request 30384: Added TellApart to list of aurora users.
On Jan. 28, 2015, 11:31 p.m., Bill Farner wrote: README.md, line 65 https://reviews.apache.org/r/30384/diff/1/?file=839312#file839312line65 http One of these things is not like the others :-) Our https site is just a 301 to http. I can keep it https if you want though. :) - Steve --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30384/#review70107 --- On Jan. 28, 2015, 11:03 p.m., Steve Niemitz wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30384/ --- (Updated Jan. 28, 2015, 11:03 p.m.) Review request for Aurora and Bill Farner. Repository: aurora Description --- Added TellApart to list of aurora users. Diffs - README.md 61e0253b1f7a3958c1e8444a071d8364b4164fab Diff: https://reviews.apache.org/r/30384/diff/ Testing --- Thanks, Steve Niemitz
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. Maxim Khutornenko wrote: An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md David McLaughlin wrote: How do we show to the user (via client output or UI) that the update is currently blocked? Maxim Khutornenko wrote: One possible way is described here: https://issues.apache.org/jira/browse/AURORA-1049 I don't think this is sufficient. We'd need auditing to explain to users why an update was paused (blocked) for a given time, not just the current status. - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 29286: Fix BUILD files in apache.aurora.admin and apache.aurora.client.api.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29286/#review70132 --- What specifically is taking place here? (I think I understand some of it, but I'm not sure what was broken or what is being improved.) - Brian Wickman On Dec. 20, 2014, 3:05 a.m., Kevin Sweeney wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29286/ --- (Updated Dec. 20, 2014, 3:05 a.m.) Review request for Aurora and Brian Wickman. Repository: aurora Description --- Fix BUILD files in apache.aurora.admin and apache.aurora.client.api. Diffs - src/main/python/apache/aurora/admin/BUILD f874264bdf07a9cbb2f0990739be3c95f851b040 src/main/python/apache/aurora/admin/host_maintenance.py 8fa5182fae509493def7175635fbe1cb79fa3eec src/main/python/apache/aurora/client/api/BUILD 65e5a85e23c4c698356c8b45c45943e560c1bcd5 src/main/python/apache/aurora/client/api/command_runner.py 48cb567c2098620e0ee322fe9528e167ce7c7c62 src/main/python/apache/aurora/client/api/disambiguator.py 6a78ccd44533ef327f751a08c9e2e16555354d97 src/test/python/apache/aurora/admin/BUILD 3a216809d1e31247f7d01451fcc7fd877a4c1fb2 src/test/python/apache/aurora/client/api/BUILD 2c0c4070cc1f1784b1d4e7f9cd8aac236e97be75 src/test/python/apache/aurora/client/api/test_task_util.py 048aff6874259810efea463df1ca2a1fdc419ca1 Diff: https://reviews.apache.org/r/29286/diff/ Testing --- ./pants src/test/python:all Thanks, Kevin Sweeney
Re: Review Request 29286: Fix BUILD files in apache.aurora.admin and apache.aurora.client.api.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29286/#review70093 --- Is this patch still relevant? - Bill Farner On Dec. 20, 2014, 3:05 a.m., Kevin Sweeney wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29286/ --- (Updated Dec. 20, 2014, 3:05 a.m.) Review request for Aurora and Brian Wickman. Repository: aurora Description --- Fix BUILD files in apache.aurora.admin and apache.aurora.client.api. Diffs - src/main/python/apache/aurora/admin/BUILD f874264bdf07a9cbb2f0990739be3c95f851b040 src/main/python/apache/aurora/admin/host_maintenance.py 8fa5182fae509493def7175635fbe1cb79fa3eec src/main/python/apache/aurora/client/api/BUILD 65e5a85e23c4c698356c8b45c45943e560c1bcd5 src/main/python/apache/aurora/client/api/command_runner.py 48cb567c2098620e0ee322fe9528e167ce7c7c62 src/main/python/apache/aurora/client/api/disambiguator.py 6a78ccd44533ef327f751a08c9e2e16555354d97 src/test/python/apache/aurora/admin/BUILD 3a216809d1e31247f7d01451fcc7fd877a4c1fb2 src/test/python/apache/aurora/client/api/BUILD 2c0c4070cc1f1784b1d4e7f9cd8aac236e97be75 src/test/python/apache/aurora/client/api/test_task_util.py 048aff6874259810efea463df1ca2a1fdc419ca1 Diff: https://reviews.apache.org/r/29286/diff/ Testing --- ./pants src/test/python:all Thanks, Kevin Sweeney
Re: Review Request 28731: Implemented TaskScheduler benchmarks.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28731/#review70095 --- Ship it! LGTM. I suspect we will learn more about how this scaffolding should look as we start submitting perf improvement patches with test cases here. - Bill Farner On Jan. 22, 2015, 1:54 a.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28731/ --- (Updated Jan. 22, 2015, 1:54 a.m.) Review request for Aurora, Kevin Sweeney and Bill Farner. Repository: aurora Description --- Added baseline benchmarks for a few static veto cases. Diffs - build.gradle f9f71a84493b782e9f6072e44e89a2c017cf2a09 src/jmh/java/org/apache/aurora/benchmark/Hosts.java PRE-CREATION src/jmh/java/org/apache/aurora/benchmark/Offers.java PRE-CREATION src/jmh/java/org/apache/aurora/benchmark/SchedulerBenchmark.java 5cecada93e4e04b689e826af49f691ed7e94ae49 src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java PRE-CREATION src/jmh/java/org/apache/aurora/benchmark/Tasks.java PRE-CREATION src/jmh/java/org/apache/aurora/benchmark/fakes/FakeDriver.java PRE-CREATION src/jmh/java/org/apache/aurora/benchmark/fakes/FakeRescheduleCalculator.java PRE-CREATION src/jmh/java/org/apache/aurora/benchmark/fakes/FakeStatsProvider.java PRE-CREATION src/main/java/org/apache/aurora/scheduler/async/TaskScheduler.java b6402ae42e3c7e4dca1c120fa6ef82d2d69e69d5 src/main/java/org/apache/aurora/scheduler/async/preemptor/CachedClusterState.java 03c2a8fde4a3fe5ac73f44da2cbe227995501c46 src/main/java/org/apache/aurora/scheduler/async/preemptor/ClusterState.java f7e157c890b5627411acd4bd5c2559ef4829147c src/main/java/org/apache/aurora/scheduler/async/preemptor/PreemptorImpl.java 0204d14b19ae412236f19ca274d81decb4eba12d Diff: https://reviews.apache.org/r/28731/diff/ Testing --- Sample run on a local box: ``` # VM invoker: /Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/jre/bin/java # VM options: -Dfile.encoding=UTF-8 -Duser.country=US -Duser.language=en -Duser.variant # Warmup: 10 iterations, 1 s each # Measurement: 100 iterations, 1 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Average time, time/op # Benchmark: org.apache.aurora.benchmark.SchedulingBenchmarks.ConstraintMismatchsSchedulingBenchmark.runBenchmark # Run progress: 0.00% complete, ETA 00:05:30 # Fork: 1 of 1 # Warmup Iteration 1: 278284250.000 ns/op # Warmup Iteration 2: 70294312.500 ns/op # Warmup Iteration 3: 19293379.310 ns/op # Warmup Iteration 4: 11945387.097 ns/op # Warmup Iteration 5: 10725388.350 ns/op # Warmup Iteration 6: 13043282.353 ns/op # Warmup Iteration 7: 9233083.333 ns/op # Warmup Iteration 8: 9521051.724 ns/op # Warmup Iteration 9: 10750854.369 ns/op # Warmup Iteration 10: 7460243.243 ns/op Iteration 1: 7885364.286 ns/op Iteration 2: 7735139.860 ns/op Iteration 3: 7660208.333 ns/op Iteration 4: 7761204.225 ns/op Iteration 5: 7868907.143 ns/op Iteration 6: 7567404.110 ns/op Iteration 7: 7611000.000 ns/op Iteration 8: 7766154.930 ns/op Iteration 9: 7669344.828 ns/op Iteration 10: 7707783.217 ns/op Iteration 11: 7435651.007 ns/op Iteration 12: 7697631.944 ns/op Iteration 13: 7712531.469 ns/op Iteration 14: 7899407.143 ns/op Iteration 15: 7448472.973 ns/op Iteration 16: 7791521.127 ns/op Iteration 17: 7612213.793 ns/op Iteration 18: 7710867.133 ns/op Iteration 19: 7649296.552 ns/op Iteration 20: 7768309.859 ns/op Iteration 21: 7688666.667 ns/op Iteration 22: 7531557.823 ns/op Iteration 23: 7381193.333 ns/op Iteration 24: 7726006.993 ns/op Iteration 25: 7603358.621 ns/op Iteration 26: 7653631.944 ns/op Iteration 27: 7442275.168 ns/op Iteration 28: 7613186.207 ns/op Iteration 29: 7765823.944 ns/op Iteration 30: 7489687.075 ns/op Iteration 31: 7811443.662 ns/op Iteration 32: 8015007.246 ns/op Iteration 33: 8192392.593 ns/op Iteration 34: 8040335.766 ns/op Iteration 35: 7584212.329 ns/op Iteration 36: 8001934.783 ns/op Iteration 37: 9744815.789 ns/op Iteration 38: 11688284.211 ns/op Iteration 39: 8661406.250 ns/op Iteration 40: 7678413.793 ns/op Iteration 41: 8502223.077 ns/op Iteration 42: 7640820.690 ns/op Iteration 43: 7875624.113 ns/op Iteration 44: 7506809.524 ns/op Iteration 45: 8005431.655 ns/op Iteration 46: 8081664.234 ns/op Iteration 47: 7579438.356 ns/op Iteration 48: 7993405.797 ns/op Iteration 49: 7571958.904 ns/op Iteration 50: 8116463.235 ns/op Iteration 51: 7941330.935 ns/op Iteration 52:
Review Request 30384: Added TellApart to list of aurora users.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30384/ --- Review request for Aurora and Bill Farner. Repository: aurora Description --- Added TellApart to list of aurora users. Diffs - README.md 61e0253b1f7a3958c1e8444a071d8364b4164fab Diff: https://reviews.apache.org/r/30384/diff/ Testing --- Thanks, Steve Niemitz
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. Maxim Khutornenko wrote: An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md David McLaughlin wrote: How do we show to the user (via client output or UI) that the update is currently blocked? Maxim Khutornenko wrote: One possible way is described here: https://issues.apache.org/jira/browse/AURORA-1049 David McLaughlin wrote: I don't think this is sufficient. We'd need auditing to explain to users why an update was paused (blocked) for a given time, not just the current status. Maxim Khutornenko wrote: That would require persistance and changing the actual status of the job. I'd rather not introduce a new state that would only be applicable to specific update configurations. The more important here is the visibility into the internal transient state to troubleshoot a coordinated job unable to make progress. David McLaughlin wrote: I don't follow the line of reasoning that 100% of updates have to use a feature for us to have a state to reflect it. I think in terms of simplicity and consistency, it makes way more sense to have an explicit UPDATE_WAITING_FOR_PULSE state that these updates are initialised in and moved to when blocked. A pulse could then have one valid state transition that would require a write: UPDATE_WAITING_FOR_PULSE - ROLLING_FORWARD. The drawbacks of this compared to this approach I think would be performance in the case of the external monitoring service flapping, or in the case of a scheduler failover. To combat this, we could also add some kind of initial grace period by setting a base timestamp at the point of leader election. This timestamp could be used in the should this be blocked? calculation when no previous pulse exists in the on-heap data structure. Maxim Khutornenko wrote: The consensus we reached on the dev list [1] is to implement the first version of the feature without any persistence at all. We agreed to look into reacher status reporting later as/if needed. IMO, adding a new state is a more involved solution. We would need to revisit the update state graph, figure out and validate the new rules for user pause/resume actions wrt the new state and pulse. Not impossible but hardly a blocking requirement either. [1] - http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser On that email thread, my last comment is to bring up this exact point :) I'm -1 on proceeding without auditing for blocked state. But if the majority disagrees, then that's fine and I'm fine for this to proceed with another ship it. - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. Maxim Khutornenko wrote: An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md David McLaughlin wrote: How do we show to the user (via client output or UI) that the update is currently blocked? Maxim Khutornenko wrote: One possible way is described here: https://issues.apache.org/jira/browse/AURORA-1049 David McLaughlin wrote: I don't think this is sufficient. We'd need auditing to explain to users why an update was paused (blocked) for a given time, not just the current status. Maxim Khutornenko wrote: That would require persistance and changing the actual status of the job. I'd rather not introduce a new state that would only be applicable to specific update configurations. The more important here is the visibility into the internal transient state to troubleshoot a coordinated job unable to make progress. David McLaughlin wrote: I don't follow the line of reasoning that 100% of updates have to use a feature for us to have a state to reflect it. I think in terms of simplicity and consistency, it makes way more sense to have an explicit UPDATE_WAITING_FOR_PULSE state that these updates are initialised in and moved to when blocked. A pulse could then have one valid state transition that would require a write: UPDATE_WAITING_FOR_PULSE - ROLLING_FORWARD. The drawbacks of this compared to this approach I think would be performance in the case of the external monitoring service flapping, or in the case of a scheduler failover. To combat this, we could also add some kind of initial grace period by setting a base timestamp at the point of leader election. This timestamp could be used in the should this be blocked? calculation when no previous pulse exists in the on-heap data structure. The consensus we reached on the dev list [1] is to implement the first version of the feature without any persistence at all. We agreed to look into reacher status reporting later as/if needed. IMO, adding a new state is a more involved solution. We would need to revisit the update state graph, figure out and validate the new rules for user pause/resume actions wrt the new state and pulse. Not impossible but hardly a blocking requirement either. [1] - http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs -
Re: Review Request 30207: Simplify AuroraCommandContext
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30207/#review70043 --- Bill, can I get a ship it here? - Zameer Manji On Jan. 22, 2015, 7:32 p.m., Zameer Manji wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30207/ --- (Updated Jan. 22, 2015, 7:32 p.m.) Review request for Aurora, Maxim Khutornenko and Bill Farner. Repository: aurora Description --- The AuroraCommandContext class is used in multiple commands and contains common code for all of them. However some portions are only used by one command. This patch takes some of those portions and moves them to the command that requires that functionality. Diffs - src/main/python/apache/aurora/client/cli/context.py 51c7d24dca664e476e62f1864d095416dfab70e4 src/main/python/apache/aurora/client/cli/jobs.py 8f349c09637c16e2499e85f2dc96eb7ccffd0aaf src/main/python/apache/aurora/client/cli/update.py PRE-CREATION src/test/python/apache/aurora/client/cli/test_supdate.py PRE-CREATION src/test/python/apache/aurora/client/cli/test_update.py 8b7d11202b35deb09a248cfe0a96458fb70c Diff: https://reviews.apache.org/r/30207/diff/ Testing --- ./pants test.pytest --no-fast src/test/python/apache/aurora/client:: Thanks, Zameer Manji
Re: Review Request 30207: Simplify AuroraCommandContext
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30207/#review70046 --- Ship it! Ship It! - Bill Farner On Jan. 23, 2015, 3:32 a.m., Zameer Manji wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30207/ --- (Updated Jan. 23, 2015, 3:32 a.m.) Review request for Aurora, Maxim Khutornenko and Bill Farner. Repository: aurora Description --- The AuroraCommandContext class is used in multiple commands and contains common code for all of them. However some portions are only used by one command. This patch takes some of those portions and moves them to the command that requires that functionality. Diffs - src/main/python/apache/aurora/client/cli/context.py 51c7d24dca664e476e62f1864d095416dfab70e4 src/main/python/apache/aurora/client/cli/jobs.py 8f349c09637c16e2499e85f2dc96eb7ccffd0aaf src/main/python/apache/aurora/client/cli/update.py PRE-CREATION src/test/python/apache/aurora/client/cli/test_supdate.py PRE-CREATION src/test/python/apache/aurora/client/cli/test_update.py 8b7d11202b35deb09a248cfe0a96458fb70c Diff: https://reviews.apache.org/r/30207/diff/ Testing --- ./pants test.pytest --no-fast src/test/python/apache/aurora/client:: Thanks, Zameer Manji
Re: Review Request 29827: End to end tests for docker in aurora
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29827/ --- (Updated Jan. 28, 2015, 6:56 p.m.) Review request for Aurora and Bill Farner. Repository: aurora Description --- This adds an end to end test for docker. Diffs (updated) - src/test/sh/org/apache/aurora/e2e/Dockerfile PRE-CREATION src/test/sh/org/apache/aurora/e2e/http/http_example_docker.aurora PRE-CREATION src/test/sh/org/apache/aurora/e2e/http/http_example_docker_updated.aurora PRE-CREATION src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 45da754341de52759d05a8960a9a978111f1e415 Diff: https://reviews.apache.org/r/29827/diff/ Testing --- Thanks, Steve Niemitz
Re: Review Request 29827: End to end tests for docker in aurora
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29827/#review70055 --- Ship it! Master (46a0466) is green with this patch. ./build-support/jenkins/build.sh I will refresh this build result if you post a review containing @ReviewBot retry - Aurora ReviewBot On Jan. 28, 2015, 6:56 p.m., Steve Niemitz wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29827/ --- (Updated Jan. 28, 2015, 6:56 p.m.) Review request for Aurora and Bill Farner. Repository: aurora Description --- This adds an end to end test for docker. Diffs - src/test/sh/org/apache/aurora/e2e/Dockerfile PRE-CREATION src/test/sh/org/apache/aurora/e2e/http/http_example_docker.aurora PRE-CREATION src/test/sh/org/apache/aurora/e2e/http/http_example_docker_updated.aurora PRE-CREATION src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 45da754341de52759d05a8960a9a978111f1e415 Diff: https://reviews.apache.org/r/29827/diff/ Testing --- Thanks, Steve Niemitz
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 27, 2015, 5:49 p.m., Joshua Cohen wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 227 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line227 Given how sensitive we are to storage lock contention, is there any work in this block that can be done outside of the lock? From a casual glance the only potentially time-consuming (non-write) operation is the fetch from the job update store, but I'm not sure how intensive an operation that will be (or the impact of performing it outside of the context of a lock). Maxim Khutornenko wrote: Moving a fetch call out of the write transaction would create a potential for a race that may compromise the feature and create hard to reason/trace corner cases. This is not worth the arguable gain in contention improvement in my opinion. David McLaughlin wrote: AFAICT the worst race condition is sending a heartbeat for a job that is already complete. Are there other race conditions we are protecting against? Also, I don't see the value in persisting the heartbeats. What is the rationale for this? Maxim Khutornenko wrote: AFAICT the worst race condition is sending a heartbeat for a job that is already complete. Are there other race conditions we are protecting against? - anything inside the evaluation logic is sensitive to the data provided in the details pull. This ranges from an invalid OK request sent back for a completed update to an internal updater failure due to outdated expectations. This pattern of querying inside of a write transaction is common in the updater (e.g. look at changeJobUpdateStatus() called any time an update is evaluated) and I'd rather not second-guess the correctness of the split. Also, I don't see the value in persisting the heartbeats. What is the rationale for this? - what makes you think they are persisted? The heartbeats are stored in-memory only (look for coordinatedUpdateStates map). Thanks for explaining, the presence of write lock threw me off and I misread the code. - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review69834 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment115014 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. - David McLaughlin On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko