Re: Review Request 30225: Modifying update controller to support heartbeats.
On Feb. 11, 2015, 1:03 a.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 282-292 https://reviews.apache.org/r/30225/diff/6/?file=853467#file853467line282 There's a race on access to `pulseStates`. Terminating an update between :282 and :287 would cause a stale entry in the map (`storage.read` doesn't offer a lock, but we shouldn't count on that anyway). Consider `synchronized (pulseStates)`. IMHO, though, this points out that this class is difficult to reason about now that multiple disparate locks are in play. Did you consider extracting an inner class to use? This would be different from the fully-isolated class we discussed before, where you can share implementation details and accept high coupling, but at least isolate management of `pulseStates` and manage it with higher-level operations from the rest of the class. Seems like you've already done this to a degree by keeping methods managing this map near each other. Your call on this. The read lock is no more but old habits die hard. You're right, it's no longer required after removing all DB access from this method. Dropped the storage code and consolidated all map access in a private class. On Feb. 11, 2015, 1:03 a.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 277 https://reviews.apache.org/r/30225/diff/6/?file=853467#file853467line277 `storeProvider` is not used, do you need the `storage.read` at all? Dropped. On Feb. 11, 2015, 1:03 a.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java, line 166 https://reviews.apache.org/r/30225/diff/6/?file=853464#file853464line166 s/query/fetch/ That's already taken. Renamed to job_update_store_fetch_details_list. On Feb. 11, 2015, 1:03 a.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 123 https://reviews.apache.org/r/30225/diff/6/?file=853467#file853467line123 Please document this map more thoroughly. How are entries added here, how are they removed, etc. For example, one useful detail is that we retain an entry for an update that is paused. Done. On Feb. 11, 2015, 1:03 a.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 124 https://reviews.apache.org/r/30225/diff/6/?file=853467#file853467line124 Please include in test coverage some validation of the contents of this map so we can gain confidence that we do not have a memory leak. It was already tested by the following assertion: `assertEquals(JobUpdatePulseStatus.FINISHED, updater.pulse(UPDATE_ID));`. This would never return FINISHED if there was an entry in the map. Added to all pulse tests to make sure all scenarios are validated. On Feb. 11, 2015, 1:03 a.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 305 https://reviews.apache.org/r/30225/diff/6/?file=853467#file853467line305 Coverage report indicates this line is not covered. Can you try to cover it? Done. On Feb. 11, 2015, 1:03 a.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 488 https://reviews.apache.org/r/30225/diff/6/?file=853467#file853467line488 Shouldn't this be in an `else`? Looks like we've just wiped a pulse record for terminal updates, but we add another here. It's updated conditionally if PulseState exists. Added else to make it more clear. On Feb. 11, 2015, 1:03 a.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 545 https://reviews.apache.org/r/30225/diff/6/?file=853467#file853467line545 This only has one caller, consider inlining. Yeah, there is only one caller now after refactoring. Merged. On Feb. 11, 2015, 1:03 a.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 547 https://reviews.apache.org/r/30225/diff/6/?file=853467#file853467line547 Coverage report indicates an uncovered branch here. Can you try to cover it? This one is tricky as it's yellow due to `pulse == null`. Given the trivial nature of this branch and complexity of covering it, I'd like to punt it. On Feb. 11, 2015, 1:03 a.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 721 https://reviews.apache.org/r/30225/diff/6/?file=853467#file853467line721 At risk of being overly-verbose, how about `initializePulseMonitor`? A quick skim of `initializePulse` makes it sound like it might be simulating an inital pulse. How about `initializePulseState`? This seems to fit
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 11, 2015, 7:19 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Changes --- Bill's comments. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). UPDATE: - Added explicit states to capture blocked updates - Refactored pulseUpdate() to not rely on DB state - Dropped JobUpdatePulseState.PAUSED as it does not seem necessary. Diffs (updated) - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review72014 --- Ship it! src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment117942 can drop the else? - Joshua Cohen On Feb. 11, 2015, 7:19 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 11, 2015, 7:19 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). UPDATE: - Added explicit states to capture blocked updates - Refactored pulseUpdate() to not rely on DB state - Dropped JobUpdatePulseState.PAUSED as it does not seem necessary. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review72022 --- Ship it! Master (7b531e9) is green with this patch. ./build-support/jenkins/build.sh I will refresh this build result if you post a review containing @ReviewBot retry - Aurora ReviewBot On Feb. 11, 2015, 7:19 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 11, 2015, 7:19 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). UPDATE: - Added explicit states to capture blocked updates - Refactored pulseUpdate() to not rely on DB state - Dropped JobUpdatePulseState.PAUSED as it does not seem necessary. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review72016 --- Ship it! Ship It! - Bill Farner On Feb. 11, 2015, 7:19 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 11, 2015, 7:19 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). UPDATE: - Added explicit states to capture blocked updates - Refactored pulseUpdate() to not rely on DB state - Dropped JobUpdatePulseState.PAUSED as it does not seem necessary. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 11, 2015, 10:01 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Changes --- Josh's comments. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). UPDATE: - Added explicit states to capture blocked updates - Refactored pulseUpdate() to not rely on DB state - Dropped JobUpdatePulseState.PAUSED as it does not seem necessary. Diffs (updated) - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Feb. 11, 2015, 8:46 p.m., Joshua Cohen wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 276 https://reviews.apache.org/r/30225/diff/7/?file=860914#file860914line276 can drop the else? Sure. - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review72014 --- On Feb. 11, 2015, 7:19 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 11, 2015, 7:19 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). UPDATE: - Added explicit states to capture blocked updates - Refactored pulseUpdate() to not rely on DB state - Dropped JobUpdatePulseState.PAUSED as it does not seem necessary. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review72028 --- Ship it! Ship It! - Joshua Cohen On Feb. 11, 2015, 10:01 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 11, 2015, 10:01 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). UPDATE: - Added explicit states to capture blocked updates - Refactored pulseUpdate() to not rely on DB state - Dropped JobUpdatePulseState.PAUSED as it does not seem necessary. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review72041 --- Ship it! Master (7b531e9) is green with this patch. ./build-support/jenkins/build.sh I will refresh this build result if you post a review containing @ReviewBot retry - Aurora ReviewBot On Feb. 11, 2015, 10:01 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 11, 2015, 10:01 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). UPDATE: - Added explicit states to capture blocked updates - Refactored pulseUpdate() to not rely on DB state - Dropped JobUpdatePulseState.PAUSED as it does not seem necessary. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71879 --- I like the shape of this, thanks for working through a few different paths. src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java https://reviews.apache.org/r/30225/#comment117774 s/query/fetch/ src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment117776 Please document this map more thoroughly. How are entries added here, how are they removed, etc. For example, one useful detail is that we retain an entry for an update that is paused. src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment117801 Please include in test coverage some validation of the contents of this map so we can gain confidence that we do not have a memory leak. src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment117779 `storeProvider` is not used, do you need the `storage.read` at all? src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment117778 There's a race on access to `pulseStates`. Terminating an update between :282 and :287 would cause a stale entry in the map (`storage.read` doesn't offer a lock, but we shouldn't count on that anyway). Consider `synchronized (pulseStates)`. IMHO, though, this points out that this class is difficult to reason about now that multiple disparate locks are in play. Did you consider extracting an inner class to use? This would be different from the fully-isolated class we discussed before, where you can share implementation details and accept high coupling, but at least isolate management of `pulseStates` and manage it with higher-level operations from the rest of the class. Seems like you've already done this to a degree by keeping methods managing this map near each other. Your call on this. src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment117821 Coverage report indicates this line is not covered. Can you try to cover it? src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment117780 Shouldn't this be in an `else`? Looks like we've just wiped a pulse record for terminal updates, but we add another here. src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment117786 This only has one caller, consider inlining. src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment117819 Coverage report indicates an uncovered branch here. Can you try to cover it? src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment117775 At risk of being overly-verbose, how about `initializePulseMonitor`? A quick skim of `initializePulse` makes it sound like it might be simulating an inital pulse. src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment117781 Your call, but you could massage out the `clock` if you store `expirationMs` and have `getExpirationMs()` instead of storing `lastPulseMs`. src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java https://reviews.apache.org/r/30225/#comment117791 This is never used as a `Function`, strongly consider converting to a method. Ditto elsewhere if it applies. src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java https://reviews.apache.org/r/30225/#comment117803 I think you've done the right thing, but note that this unit test is inconsistent with your reply on review https://reviews.apache.org/r/30804/ - Bill Farner On Feb. 7, 2015, 2:43 a.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 7, 2015, 2:43 a.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71729 --- Ship it! Pulse logic LGTM. - David McLaughlin On Feb. 7, 2015, 2:43 a.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 7, 2015, 2:43 a.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). UPDATE: - Added explicit states to capture blocked updates - Refactored pulseUpdate() to not rely on DB state - Dropped JobUpdatePulseState.PAUSED as it does not seem necessary. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71545 --- Ship it! Master (11a65d2) is green with this patch. ./build-support/jenkins/build.sh I will refresh this build result if you post a review containing @ReviewBot retry - Aurora ReviewBot On Feb. 7, 2015, 2:43 a.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 7, 2015, 2:43 a.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). UPDATE: - Added explicit states to capture blocked updates - Refactored pulseUpdate() to not rely on DB state - Dropped JobUpdatePulseState.PAUSED as it does not seem necessary. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 7, 2015, 2:43 a.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description (updated) --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). UPDATE: - Added explicit states to capture blocked updates - Refactored pulseUpdate() to not rely on DB state - Dropped JobUpdatePulseState.PAUSED as it does not seem necessary. Diffs (updated) - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Feb. 5, 2015, 6:07 p.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 308 https://reviews.apache.org/r/30225/diff/5/?file=849498#file849498line308 Deferrment is not quite what i had in mind. I was thinking something more along the lines of how `TaskTimeout` works, where receiving a pulse causes you to schedule an action in the future that will change the update state (provided no more pulses are received). If we go this route, i think a separate class should be used to manage heartbeats, and this presents a logical place to break this diff apart. As the code stands, we still have the problem where a scheduler with idle inputs will not transition the update state. As the code stands, we still have the problem where a scheduler with idle inputs will not transition the update state. I still don't see a reason to overcomplicate the logic here. Why move the update into a new state when there is no internal activity and why should it be any different that instance-driven transition events? - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71245 --- On Feb. 5, 2015, 2:34 a.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 5, 2015, 2:34 a.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71245 --- src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java https://reviews.apache.org/r/30225/#comment116915 s/query/fetch/ src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment116917 Deferrment is not quite what i had in mind. I was thinking something more along the lines of how `TaskTimeout` works, where receiving a pulse causes you to schedule an action in the future that will change the update state (provided no more pulses are received). If we go this route, i think a separate class should be used to manage heartbeats, and this presents a logical place to break this diff apart. As the code stands, we still have the problem where a scheduler with idle inputs will not transition the update state. - Bill Farner On Feb. 5, 2015, 2:34 a.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 5, 2015, 2:34 a.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70992 --- Ship it! Master (8bcb2ba) is green with this patch. ./build-support/jenkins/build.sh I will refresh this build result if you post a review containing @ReviewBot retry - Aurora ReviewBot On Feb. 4, 2015, 5:24 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 4, 2015, 5:24 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 4, 2015, 5:24 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Changes --- Fixing method name escaped during rebase. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs (updated) - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 273 https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line273 I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. For these reasons, i think it would be wise to decouple recording of pulses from reacting to them. It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. - makes sense, I can try to split the recording part from the status update via an async action. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. - I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK). It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. - the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature. - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71063 --- On Feb. 4, 2015, 5:24 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 4, 2015, 5:24 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71063 --- I suggest you skip to the big comment before paying attention to the smaller ones. If you agree with the concern, i suggest an initial focused diff that implements receiving heartbeats, and asynchronously reacting to the lack of their presence. api/src/main/thrift/org/apache/aurora/gen/api.thrift https://reviews.apache.org/r/30225/#comment116606 Maybe s/BLOCKED/AWAITING_PULSE/? That would at least self-document and avoid additional terminology. src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java https://reviews.apache.org/r/30225/#comment116607 s/query/fetch/ to be consistent with the method above. src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment116619 Remove Optional here, do Optional.of() at the call site. src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment116620 TODO + ticket, this map needs to be exposed via a debug endpoint src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment116626 I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. For these reasons, i think it would be wise to decouple recording of pulses from reacting to them. It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml https://reviews.apache.org/r/30225/#comment116643 is it possible to reduce the diff a bit by moving this block to its former location, or is it needed - Bill Farner On Feb. 4, 2015, 5:24 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 4, 2015, 5:24 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 273 https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line273 I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. For these reasons, i think it would be wise to decouple recording of pulses from reacting to them. It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. Maxim Khutornenko wrote: I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. - makes sense, I can try to split the recording part from the status update via an async action. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. - I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK). It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. - the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature. I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK) Yeah, this crossed my mind. Perhaps we should poke at the API to see if we can shake those requirements out? For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to differentiate between a paused and finished update? I don't feel really strongly about this, but i think we should feel free to question the requirements at this phase. the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature Transitions do indeed happen asynchronously when dealing with timeouts (relevant code is in `InstanceActionHandler`). These fall back to the same loop, but give the updater a time after which the state should be re-evaluated. - Bill --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71063 --- On Feb. 4, 2015, 5:24 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 4, 2015, 5:24 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 273 https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line273 I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. For these reasons, i think it would be wise to decouple recording of pulses from reacting to them. It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. Maxim Khutornenko wrote: I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. - makes sense, I can try to split the recording part from the status update via an async action. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. - I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK). It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. - the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature. Bill Farner wrote: I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK) Yeah, this crossed my mind. Perhaps we should poke at the API to see if we can shake those requirements out? For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to differentiate between a paused and finished update? I don't feel really strongly about this, but i think we should feel free to question the requirements at this phase. the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature Transitions do indeed happen asynchronously when dealing with timeouts (relevant code is in `InstanceActionHandler`). These fall back to the same loop, but give the updater a time after which the state should be re-evaluated. Yeah, this crossed my mind. Perhaps we should poke at the API to see if we can shake those requirements out? For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to differentiate between a paused and finished update? I don't feel really strongly about this, but i think we should feel free to question the requirements at this phase. I don't think this would buy us much if external service starts following up each pulse that returns UPDATE_NOT_ACTIVE. It would just makes the contract harder to consume. If we are concerned about obuse I think having a rate
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 273 https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line273 I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. For these reasons, i think it would be wise to decouple recording of pulses from reacting to them. It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. Maxim Khutornenko wrote: I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. - makes sense, I can try to split the recording part from the status update via an async action. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. - I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK). It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. - the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature. Bill Farner wrote: I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK) Yeah, this crossed my mind. Perhaps we should poke at the API to see if we can shake those requirements out? For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to differentiate between a paused and finished update? I don't feel really strongly about this, but i think we should feel free to question the requirements at this phase. the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature Transitions do indeed happen asynchronously when dealing with timeouts (relevant code is in `InstanceActionHandler`). These fall back to the same loop, but give the updater a time after which the state should be re-evaluated. Maxim Khutornenko wrote: Yeah, this crossed my mind. Perhaps we should poke at the API to see if we can shake those requirements out? For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to differentiate between a paused and finished update? I don't feel really strongly about this, but i think we should feel free to question the requirements at this phase. I don't think this would buy us much if external service starts following up each pulse that returns UPDATE_NOT_ACTIVE. It would just makes the contract harder to consume. If we are
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 273 https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line273 I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. For these reasons, i think it would be wise to decouple recording of pulses from reacting to them. It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. Maxim Khutornenko wrote: I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. - makes sense, I can try to split the recording part from the status update via an async action. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. - I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK). It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. - the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature. Bill Farner wrote: I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK) Yeah, this crossed my mind. Perhaps we should poke at the API to see if we can shake those requirements out? For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to differentiate between a paused and finished update? I don't feel really strongly about this, but i think we should feel free to question the requirements at this phase. the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature Transitions do indeed happen asynchronously when dealing with timeouts (relevant code is in `InstanceActionHandler`). These fall back to the same loop, but give the updater a time after which the state should be re-evaluated. Maxim Khutornenko wrote: Yeah, this crossed my mind. Perhaps we should poke at the API to see if we can shake those requirements out? For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to differentiate between a paused and finished update? I don't feel really strongly about this, but i think we should feel free to question the requirements at this phase. I don't think this would buy us much if external service starts following up each pulse that returns UPDATE_NOT_ACTIVE. It would just makes the contract harder to consume. If we are
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71114 --- src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment116725 I think we still want to update the last pulse time even if it's paused? - David McLaughlin On Feb. 4, 2015, 5:24 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 4, 2015, 5:24 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71143 --- Ship it! Master (edcc252) is green with this patch. ./build-support/jenkins/build.sh I will refresh this build result if you post a review containing @ReviewBot retry - Aurora ReviewBot On Feb. 5, 2015, 2:34 a.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 5, 2015, 2:34 a.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 273 https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line273 I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. For these reasons, i think it would be wise to decouple recording of pulses from reacting to them. It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. Maxim Khutornenko wrote: I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. - makes sense, I can try to split the recording part from the status update via an async action. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. - I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK). It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. - the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature. Bill Farner wrote: I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK) Yeah, this crossed my mind. Perhaps we should poke at the API to see if we can shake those requirements out? For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to differentiate between a paused and finished update? I don't feel really strongly about this, but i think we should feel free to question the requirements at this phase. the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature Transitions do indeed happen asynchronously when dealing with timeouts (relevant code is in `InstanceActionHandler`). These fall back to the same loop, but give the updater a time after which the state should be re-evaluated. Maxim Khutornenko wrote: Yeah, this crossed my mind. Perhaps we should poke at the API to see if we can shake those requirements out? For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to differentiate between a paused and finished update? I don't feel really strongly about this, but i think we should feel free to question the requirements at this phase. I don't think this would buy us much if external service starts following up each pulse that returns UPDATE_NOT_ACTIVE. It would just makes the contract harder to consume. If we are
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: api/src/main/thrift/org/apache/aurora/gen/api.thrift, line 565 https://reviews.apache.org/r/30225/diff/4/?file=848240#file848240line565 Maybe s/BLOCKED/AWAITING_PULSE/? That would at least self-document and avoid additional terminology. Renamed. On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java, line 49 https://reviews.apache.org/r/30225/diff/4/?file=848242#file848242line49 s/query/fetch/ to be consistent with the method above. I started with fetch but that resulted in an overloaded method that I did not quite liked. Changed it back. On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 106 https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line106 Remove Optional here, do Optional.of() at the call site. Removed. On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 122 https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line122 TODO + ticket, this map needs to be exposed via a debug endpoint Done. On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 273 https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line273 I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. For these reasons, i think it would be wise to decouple recording of pulses from reacting to them. It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. Maxim Khutornenko wrote: I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. - makes sense, I can try to split the recording part from the status update via an async action. Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. - I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK). It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. - the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature. Bill Farner wrote: I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK) Yeah, this crossed my mind. Perhaps we should poke at the API to see if we can shake those requirements out? For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to differentiate between a paused and finished update? I don't feel really strongly about this, but i think we should feel free to question the requirements at this phase. the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Feb. 5, 2015, 1:15 a.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 293 https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line293 I think we still want to update the last pulse time even if it's paused? Makes sense. A paused update is still technically active. Changed to support PAUSED pulse. - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71114 --- On Feb. 4, 2015, 5:24 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 4, 2015, 5:24 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 5, 2015, 2:34 a.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Changes --- Bill's and David's feedback. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs (updated) - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 92cfa2b30c1c4daa3ae225fc8609fbeaecdaff7a src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70913 --- Master (8bcb2ba) is red with this patch. ./build-support/jenkins/build.sh make[4]: Entering directory `/home/jenkins/jenkins-slave/workspace/AuroraBot/build-support/thrift/thrift-0.9.1/tutorial' ../compiler/cpp/thrift --gen html -r ../tutorial/tutorial.thrift make[4]: Leaving directory `/home/jenkins/jenkins-slave/workspace/AuroraBot/build-support/thrift/thrift-0.9.1/tutorial' make[3]: Leaving directory `/home/jenkins/jenkins-slave/workspace/AuroraBot/build-support/thrift/thrift-0.9.1/tutorial' make[3]: Entering directory `/home/jenkins/jenkins-slave/workspace/AuroraBot/build-support/thrift/thrift-0.9.1' make[3]: Leaving directory `/home/jenkins/jenkins-slave/workspace/AuroraBot/build-support/thrift/thrift-0.9.1' make[2]: Leaving directory `/home/jenkins/jenkins-slave/workspace/AuroraBot/build-support/thrift/thrift-0.9.1' make[1]: Leaving directory `/home/jenkins/jenkins-slave/workspace/AuroraBot/build-support/thrift/thrift-0.9.1' make: Leaving directory `/home/jenkins/jenkins-slave/workspace/AuroraBot/build-support/thrift' :api:classesThriftNote: Some input files use unchecked or unsafe operations. Note: Recompile with -Xlint:unchecked for details. :api:checkPython :api:generateThriftEntitiesJava :api:classesThriftEntities :api:compileJava UP-TO-DATE :api:generateThriftResources :api:processResources UP-TO-DATE :api:classes :api:jar :compileJavaNote: Writing file:/home/jenkins/jenkins-slave/workspace/AuroraBot/dist/classes/main/com/twitter/common/args/apt/cmdline.arg.info.txt.2 /home/jenkins/jenkins-slave/workspace/AuroraBot/src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java:166: error: cannot find symbol storeProvider.getJobUpdateStore().fetchJobUpdateSummaries(queryByJob(job)); ^ symbol: method queryByJob(IJobKey) 1 error FAILED FAILURE: Build failed with an exception. * What went wrong: Execution failed for task ':compileJava'. Compilation failed; see the compiler error output for details. * Try: Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. BUILD FAILED Total time: 1 mins 19.808 secs I will refresh this build result if you post a review containing @ReviewBot retry - Aurora ReviewBot On Feb. 4, 2015, 2:29 a.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 4, 2015, 2:29 a.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70904 --- This patch does not apply cleanly on master (9fe6d54), do you need to rebase? I will refresh this build result if you post a review containing @ReviewBot retry - Aurora ReviewBot On Feb. 4, 2015, 2:03 a.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Feb. 4, 2015, 2:03 a.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. Maxim Khutornenko wrote: An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md David McLaughlin wrote: How do we show to the user (via client output or UI) that the update is currently blocked? Maxim Khutornenko wrote: One possible way is described here: https://issues.apache.org/jira/browse/AURORA-1049 David McLaughlin wrote: I don't think this is sufficient. We'd need auditing to explain to users why an update was paused (blocked) for a given time, not just the current status. Maxim Khutornenko wrote: That would require persistance and changing the actual status of the job. I'd rather not introduce a new state that would only be applicable to specific update configurations. The more important here is the visibility into the internal transient state to troubleshoot a coordinated job unable to make progress. David McLaughlin wrote: I don't follow the line of reasoning that 100% of updates have to use a feature for us to have a state to reflect it. I think in terms of simplicity and consistency, it makes way more sense to have an explicit UPDATE_WAITING_FOR_PULSE state that these updates are initialised in and moved to when blocked. A pulse could then have one valid state transition that would require a write: UPDATE_WAITING_FOR_PULSE - ROLLING_FORWARD. The drawbacks of this compared to this approach I think would be performance in the case of the external monitoring service flapping, or in the case of a scheduler failover. To combat this, we could also add some kind of initial grace period by setting a base timestamp at the point of leader election. This timestamp could be used in the should this be blocked? calculation when no previous pulse exists in the on-heap data structure. Maxim Khutornenko wrote: The consensus we reached on the dev list [1] is to implement the first version of the feature without any persistence at all. We agreed to look into reacher status reporting later as/if needed. IMO, adding a new state is a more involved solution. We would need to revisit the update state graph, figure out and validate the new rules for user pause/resume actions wrt the new state and pulse. Not impossible but hardly a blocking requirement either. [1] - http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser David McLaughlin wrote: On that email thread, my last comment is to bring up this exact point :) I'm -1 on proceeding without auditing for blocked state. But if the majority disagrees, then that's fine and I'm fine for this to proceed with another ship it. Bill Farner wrote: David - I do think you raise a valid point, and we at least need to determine whether this is a reasonable tradeoff. Can you start a thread on dev@ to try to reach consensus? Done. - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70076 --- Ship it! Looks reasonable to me (caveat being I'm not super familiar with the internals of the scheduler driven updates). - Joshua Cohen On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. Maxim Khutornenko wrote: An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md David McLaughlin wrote: How do we show to the user (via client output or UI) that the update is currently blocked? One possible way is described here: https://issues.apache.org/jira/browse/AURORA-1049 - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. Maxim Khutornenko wrote: An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md How do we show to the user (via client output or UI) that the update is currently blocked? - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. Maxim Khutornenko wrote: An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md David McLaughlin wrote: How do we show to the user (via client output or UI) that the update is currently blocked? Maxim Khutornenko wrote: One possible way is described here: https://issues.apache.org/jira/browse/AURORA-1049 David McLaughlin wrote: I don't think this is sufficient. We'd need auditing to explain to users why an update was paused (blocked) for a given time, not just the current status. Maxim Khutornenko wrote: That would require persistance and changing the actual status of the job. I'd rather not introduce a new state that would only be applicable to specific update configurations. The more important here is the visibility into the internal transient state to troubleshoot a coordinated job unable to make progress. I don't follow the line of reasoning that 100% of updates have to use a feature for us to have a state to reflect it. I think in terms of simplicity and consistency, it makes way more sense to have an explicit UPDATE_WAITING_FOR_PULSE state that these updates are initialised in and moved to when blocked. A pulse could then have one valid state transition that would require a write: UPDATE_WAITING_FOR_PULSE - ROLLING_FORWARD. The drawbacks of this compared to this approach I think would be performance in the case of the external monitoring service flapping, or in the case of a scheduler failover. To combat this, we could also add some kind of initial grace period by setting a base timestamp at the point of leader election. This timestamp could be used in the should this be blocked? calculation when no previous pulse exists in the on-heap data structure. - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff:
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. Maxim Khutornenko wrote: An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md David McLaughlin wrote: How do we show to the user (via client output or UI) that the update is currently blocked? Maxim Khutornenko wrote: One possible way is described here: https://issues.apache.org/jira/browse/AURORA-1049 I don't think this is sufficient. We'd need auditing to explain to users why an update was paused (blocked) for a given time, not just the current status. - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. Maxim Khutornenko wrote: An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md David McLaughlin wrote: How do we show to the user (via client output or UI) that the update is currently blocked? Maxim Khutornenko wrote: One possible way is described here: https://issues.apache.org/jira/browse/AURORA-1049 David McLaughlin wrote: I don't think this is sufficient. We'd need auditing to explain to users why an update was paused (blocked) for a given time, not just the current status. Maxim Khutornenko wrote: That would require persistance and changing the actual status of the job. I'd rather not introduce a new state that would only be applicable to specific update configurations. The more important here is the visibility into the internal transient state to troubleshoot a coordinated job unable to make progress. David McLaughlin wrote: I don't follow the line of reasoning that 100% of updates have to use a feature for us to have a state to reflect it. I think in terms of simplicity and consistency, it makes way more sense to have an explicit UPDATE_WAITING_FOR_PULSE state that these updates are initialised in and moved to when blocked. A pulse could then have one valid state transition that would require a write: UPDATE_WAITING_FOR_PULSE - ROLLING_FORWARD. The drawbacks of this compared to this approach I think would be performance in the case of the external monitoring service flapping, or in the case of a scheduler failover. To combat this, we could also add some kind of initial grace period by setting a base timestamp at the point of leader election. This timestamp could be used in the should this be blocked? calculation when no previous pulse exists in the on-heap data structure. Maxim Khutornenko wrote: The consensus we reached on the dev list [1] is to implement the first version of the feature without any persistence at all. We agreed to look into reacher status reporting later as/if needed. IMO, adding a new state is a more involved solution. We would need to revisit the update state graph, figure out and validate the new rules for user pause/resume actions wrt the new state and pulse. Not impossible but hardly a blocking requirement either. [1] - http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser On that email thread, my last comment is to bring up this exact point :) I'm -1 on proceeding without auditing for blocked state. But if the majority disagrees, then that's fine and I'm fine for this to proceed with another ship it. - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. Maxim Khutornenko wrote: An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: - An update can be created blocked by default awaiting for the first pulse to start its progress; - An occasional network partition/delay will not require an explicit external service operation to resume; - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md David McLaughlin wrote: How do we show to the user (via client output or UI) that the update is currently blocked? Maxim Khutornenko wrote: One possible way is described here: https://issues.apache.org/jira/browse/AURORA-1049 David McLaughlin wrote: I don't think this is sufficient. We'd need auditing to explain to users why an update was paused (blocked) for a given time, not just the current status. Maxim Khutornenko wrote: That would require persistance and changing the actual status of the job. I'd rather not introduce a new state that would only be applicable to specific update configurations. The more important here is the visibility into the internal transient state to troubleshoot a coordinated job unable to make progress. David McLaughlin wrote: I don't follow the line of reasoning that 100% of updates have to use a feature for us to have a state to reflect it. I think in terms of simplicity and consistency, it makes way more sense to have an explicit UPDATE_WAITING_FOR_PULSE state that these updates are initialised in and moved to when blocked. A pulse could then have one valid state transition that would require a write: UPDATE_WAITING_FOR_PULSE - ROLLING_FORWARD. The drawbacks of this compared to this approach I think would be performance in the case of the external monitoring service flapping, or in the case of a scheduler failover. To combat this, we could also add some kind of initial grace period by setting a base timestamp at the point of leader election. This timestamp could be used in the should this be blocked? calculation when no previous pulse exists in the on-heap data structure. The consensus we reached on the dev list [1] is to implement the first version of the feature without any persistence at all. We agreed to look into reacher status reporting later as/if needed. IMO, adding a new state is a more involved solution. We would need to revisit the update state graph, figure out and validate the new rules for user pause/resume actions wrt the new state and pulse. Not impossible but hardly a blocking requirement either. [1] - http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs -
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 27, 2015, 5:49 p.m., Joshua Cohen wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 227 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line227 Given how sensitive we are to storage lock contention, is there any work in this block that can be done outside of the lock? From a casual glance the only potentially time-consuming (non-write) operation is the fetch from the job update store, but I'm not sure how intensive an operation that will be (or the impact of performing it outside of the context of a lock). Maxim Khutornenko wrote: Moving a fetch call out of the write transaction would create a potential for a race that may compromise the feature and create hard to reason/trace corner cases. This is not worth the arguable gain in contention improvement in my opinion. David McLaughlin wrote: AFAICT the worst race condition is sending a heartbeat for a job that is already complete. Are there other race conditions we are protecting against? Also, I don't see the value in persisting the heartbeats. What is the rationale for this? Maxim Khutornenko wrote: AFAICT the worst race condition is sending a heartbeat for a job that is already complete. Are there other race conditions we are protecting against? - anything inside the evaluation logic is sensitive to the data provided in the details pull. This ranges from an invalid OK request sent back for a completed update to an internal updater failure due to outdated expectations. This pattern of querying inside of a write transaction is common in the updater (e.g. look at changeJobUpdateStatus() called any time an update is evaluated) and I'd rather not second-guess the correctness of the split. Also, I don't see the value in persisting the heartbeats. What is the rationale for this? - what makes you think they are persisted? The heartbeats are stored in-memory only (look for coordinatedUpdateStates map). Thanks for explaining, the presence of write lock threw me off and I misread the code. - David --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review69834 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 --- src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment115014 I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. So surely the last pulse time would be checked externally to the method that performs the pulse? If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. - David McLaughlin On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
On Jan. 27, 2015, 5:49 p.m., Joshua Cohen wrote: src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 227 https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line227 Given how sensitive we are to storage lock contention, is there any work in this block that can be done outside of the lock? From a casual glance the only potentially time-consuming (non-write) operation is the fetch from the job update store, but I'm not sure how intensive an operation that will be (or the impact of performing it outside of the context of a lock). Maxim Khutornenko wrote: Moving a fetch call out of the write transaction would create a potential for a race that may compromise the feature and create hard to reason/trace corner cases. This is not worth the arguable gain in contention improvement in my opinion. David McLaughlin wrote: AFAICT the worst race condition is sending a heartbeat for a job that is already complete. Are there other race conditions we are protecting against? Also, I don't see the value in persisting the heartbeats. What is the rationale for this? AFAICT the worst race condition is sending a heartbeat for a job that is already complete. Are there other race conditions we are protecting against? - anything inside the evaluation logic is sensitive to the data provided in the details pull. This ranges from an invalid OK request sent back for a completed update to an internal updater failure due to outdated expectations. This pattern of querying inside of a write transaction is common in the updater (e.g. look at changeJobUpdateStatus() called any time an update is evaluated) and I'd rather not second-guess the correctness of the split. Also, I don't see the value in persisting the heartbeats. What is the rationale for this? - what makes you think they are persisted? The heartbeats are stored in-memory only (look for coordinatedUpdateStates map). - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review69834 --- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review69825 --- Ping. - Maxim Khutornenko On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review69834 --- src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java https://reviews.apache.org/r/30225/#comment114636 Given how sensitive we are to storage lock contention, is there any work in this block that can be done outside of the lock? From a casual glance the only potentially time-consuming (non-write) operation is the fetch from the job update store, but I'm not sure how intensive an operation that will be (or the impact of performing it outside of the context of a lock). - Joshua Cohen On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. Bugs: AURORA-1010 https://issues.apache.org/jira/browse/AURORA-1010 Repository: aurora Description --- Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. Diffs - src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 Diff: https://reviews.apache.org/r/30225/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 30225: Modifying update controller to support heartbeats.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review69460 --- Master (3fa004b) is red with this patch. ./build-support/jenkins/build.sh src.test.python.apache.aurora.client.cli.config . SUCCESS src.test.python.apache.aurora.client.cli.cron . SUCCESS src.test.python.apache.aurora.client.cli.inspect . SUCCESS src.test.python.apache.aurora.client.cli.job . SUCCESS src.test.python.apache.aurora.client.cli.plugins . SUCCESS src.test.python.apache.aurora.client.cli.quota . SUCCESS src.test.python.apache.aurora.client.cli.sla . SUCCESS src.test.python.apache.aurora.client.cli.supdate . SUCCESS src.test.python.apache.aurora.client.cli.task . SUCCESS src.test.python.apache.aurora.client.cli.update . SUCCESS src.test.python.apache.aurora.client.cli.version . SUCCESS src.test.python.apache.aurora.client.config . SUCCESS src.test.python.apache.aurora.client.hooks.hooked_api . SUCCESS src.test.python.apache.aurora.client.hooks.non_hooked_api . SUCCESS src.test.python.apache.aurora.common.test_aurora_job_key . SUCCESS src.test.python.apache.aurora.common.test_cluster . SUCCESS src.test.python.apache.aurora.common.test_cluster_option . SUCCESS src.test.python.apache.aurora.common.test_clusters . SUCCESS src.test.python.apache.aurora.common.test_http_signaler . SUCCESS src.test.python.apache.aurora.common.test_pex_version . SUCCESS src.test.python.apache.aurora.common.test_shellify . SUCCESS src.test.python.apache.aurora.common.test_transport . SUCCESS src.test.python.apache.aurora.config.test_base . SUCCESS src.test.python.apache.aurora.config.test_constraint_parsing . SUCCESS src.test.python.apache.aurora.config.test_loader . SUCCESS src.test.python.apache.aurora.config.test_thrift . SUCCESS src.test.python.apache.aurora.executor.common.task_info . SUCCESS src.test.python.apache.aurora.executor.executor_base . SUCCESS src.test.python.apache.aurora.executor.executor_detector . SUCCESS src.test.python.apache.aurora.executor.executor_vars . SUCCESS src.test.python.apache.aurora.executor.status_manager . SUCCESS src.test.python.apache.aurora.executor.thermos_task_runner . FAILURE src.test.python.apache.thermos.common.test_pathspec . SUCCESS src.test.python.apache.thermos.core.test_runner_integration . SUCCESS src.test.python.apache.thermos.monitoring.test_disk . SUCCESS FAILURE [31m FAILURE[0m I will refresh this build result if you post a review containing @ReviewBot retry - Aurora ReviewBot On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/ --- (Updated Jan. 23, 2015, 8:37 p.m.) Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill