Re: Review Request 41226: Handling task event race in updater.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/41226/#review110696 --- Ship it! Master (fb8155d) is green with this patch. true I will refresh this build result if you post a review containing "@ReviewBot retry" - John Sirois On Dec. 11, 2015, 1:33 a.m., Maxim Khutornenko wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/41226/ > --- > > (Updated Dec. 11, 2015, 1:33 a.m.) > > > Review request for Aurora and Bill Farner. > > > Bugs: AURORA-1506 and AURORA-1507 > https://issues.apache.org/jira/browse/AURORA-1506 > https://issues.apache.org/jira/browse/AURORA-1507 > > > Repository: aurora > > > Description > --- > > Task event processing delays may generate duplicate evaluation results in the > `InstanceActionHandler`. This leads to occasional stack traces in the log but > is generally benign as `JobUpdateEventSubscriber` catches these errors. > > The easiest fix is to check for the task existence before attempting to > kill/add instance. > > > Diffs > - > > > src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java > d8686f1f1b53e5ff2791663489e24c342503831e > src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java > 0583a63a175880355a7296ebd1c6e6fb5dc99f38 > src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/41226/diff/ > > > Testing > --- > > ./gradlew -Pq build > > > Thanks, > > Maxim Khutornenko > >
Re: Review Request 41226: Handling task event race in updater.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/41226/#review110614 --- Ping. - Maxim Khutornenko On Dec. 11, 2015, 1:33 a.m., Maxim Khutornenko wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/41226/ > --- > > (Updated Dec. 11, 2015, 1:33 a.m.) > > > Review request for Aurora and Bill Farner. > > > Bugs: AURORA-1506 and AURORA-1507 > https://issues.apache.org/jira/browse/AURORA-1506 > https://issues.apache.org/jira/browse/AURORA-1507 > > > Repository: aurora > > > Description > --- > > Task event processing delays may generate duplicate evaluation results in the > `InstanceActionHandler`. This leads to occasional stack traces in the log but > is generally benign as `JobUpdateEventSubscriber` catches these errors. > > The easiest fix is to check for the task existence before attempting to > kill/add instance. > > > Diffs > - > > > src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java > d8686f1f1b53e5ff2791663489e24c342503831e > src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java > 0583a63a175880355a7296ebd1c6e6fb5dc99f38 > src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/41226/diff/ > > > Testing > --- > > ./gradlew -Pq build > > > Thanks, > > Maxim Khutornenko > >
Re: Review Request 41226: Handling task event race in updater.
> On Dec. 11, 2015, 7:07 p.m., Bill Farner wrote: > > Can you take a stab at illustrating in the description how this problem > > manifests? The current shape of the patch feels like it papers over a > > problem that we might be left to solve again, so thinking about the problem > > at a high level might help us think about a broader fix. I agree it may feel somewhat hacky but it's the easiest way to handle updater/event race without adding more moving parts or excessive locking. The core of the problem is (as mentioned in the description) in the way we process task events. After moving into async processing, events can be (and are) delayed. It's a hard problem to follow but I'll try to explain with an example of AURORA-1506: - instance 1 fails while been watched by the updater, event F is dispatched (e.g.: RUNNING->FAILED) - instance 1 is rescheduled and event R is dispatched (e.g.: INIT->THROTTLED) - event F reached updater and updater decides to rollback, issues a kill for the instance intending to return to previous config - THROTTLED task is deleted from the store without firing a task change event (but firing a task deleted event) - event R is finally processed by async bus and updater is notified. Updater follows the evaluate path and "thinks" a KILL is still required, which is legit but does not make sense as the task is no longer in the store. So, the crux of the problem is entering updater evaluator loop with a stale task info. Figuring out if the task info is stale at the subscriber point is impossible as it requires updater inner state. The best, as it seems to me, let the updater figure out what to do but prevent the state changing action if it conflicts with the store. Sort of optimistic evaluation run. Now to the question: The only downside of this race is a stack trace in the log. The actual errors are handled by the `JobUpdateEventSubscriber` and do not fail the update. These errors are polluting our logs and monitoring stats. - Maxim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/41226/#review109984 --- On Dec. 11, 2015, 1:33 a.m., Maxim Khutornenko wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/41226/ > --- > > (Updated Dec. 11, 2015, 1:33 a.m.) > > > Review request for Aurora and Bill Farner. > > > Bugs: AURORA-1506 and AURORA-1507 > https://issues.apache.org/jira/browse/AURORA-1506 > https://issues.apache.org/jira/browse/AURORA-1507 > > > Repository: aurora > > > Description > --- > > Task event processing delays may generate duplicate evaluation results in the > `InstanceActionHandler`. This leads to occasional stack traces in the log but > is generally benign as `JobUpdateEventSubscriber` catches these errors. > > The easiest fix is to check for the task existence before attempting to > kill/add instance. > > > Diffs > - > > > src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java > d8686f1f1b53e5ff2791663489e24c342503831e > src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java > 0583a63a175880355a7296ebd1c6e6fb5dc99f38 > src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/41226/diff/ > > > Testing > --- > > ./gradlew -Pq build > > > Thanks, > > Maxim Khutornenko > >
Re: Review Request 41226: Handling task event race in updater.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/41226/#review109984 --- Can you take a stab at illustrating in the description how this problem manifests? The current shape of the patch feels like it papers over a problem that we might be left to solve again, so thinking about the problem at a high level might help us think about a broader fix. - Bill Farner On Dec. 10, 2015, 5:33 p.m., Maxim Khutornenko wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/41226/ > --- > > (Updated Dec. 10, 2015, 5:33 p.m.) > > > Review request for Aurora and Bill Farner. > > > Bugs: AURORA-1506 and AURORA-1507 > https://issues.apache.org/jira/browse/AURORA-1506 > https://issues.apache.org/jira/browse/AURORA-1507 > > > Repository: aurora > > > Description > --- > > Task event processing delays may generate duplicate evaluation results in the > `InstanceActionHandler`. This leads to occasional stack traces in the log but > is generally benign as `JobUpdateEventSubscriber` catches these errors. > > The easiest fix is to check for the task existence before attempting to > kill/add instance. > > > Diffs > - > > > src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java > d8686f1f1b53e5ff2791663489e24c342503831e > src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java > 0583a63a175880355a7296ebd1c6e6fb5dc99f38 > src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/41226/diff/ > > > Testing > --- > > ./gradlew -Pq build > > > Thanks, > > Maxim Khutornenko > >
Re: Review Request 41226: Handling task event race in updater.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/41226/ --- (Updated Dec. 11, 2015, 1:33 a.m.) Review request for Aurora and Bill Farner. Changes --- Realized AURORA-1507 has the same root cause. Bundling it into this diff. Summary (updated) - Handling task event race in updater. Bugs: AURORA-1506 and AURORA-1507 https://issues.apache.org/jira/browse/AURORA-1506 https://issues.apache.org/jira/browse/AURORA-1507 Repository: aurora Description (updated) --- Task event processing delays may generate duplicate evaluation results in the `InstanceActionHandler`. This leads to occasional stack traces in the log but is generally benign as `JobUpdateEventSubscriber` catches these errors. The easiest fix is to check for the task existence before attempting to kill/add instance. Diffs (updated) - src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java d8686f1f1b53e5ff2791663489e24c342503831e src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 0583a63a175880355a7296ebd1c6e6fb5dc99f38 src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java PRE-CREATION Diff: https://reviews.apache.org/r/41226/diff/ Testing --- ./gradlew -Pq build Thanks, Maxim Khutornenko
Re: Review Request 41226: Handling task event race in updater.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/41226/#review109906 --- Ship it! Master (6933a71) is green with this patch. ./build-support/jenkins/build.sh I will refresh this build result if you post a review containing "@ReviewBot retry" - Aurora ReviewBot On Dec. 11, 2015, 1:33 a.m., Maxim Khutornenko wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/41226/ > --- > > (Updated Dec. 11, 2015, 1:33 a.m.) > > > Review request for Aurora and Bill Farner. > > > Bugs: AURORA-1506 and AURORA-1507 > https://issues.apache.org/jira/browse/AURORA-1506 > https://issues.apache.org/jira/browse/AURORA-1507 > > > Repository: aurora > > > Description > --- > > Task event processing delays may generate duplicate evaluation results in the > `InstanceActionHandler`. This leads to occasional stack traces in the log but > is generally benign as `JobUpdateEventSubscriber` catches these errors. > > The easiest fix is to check for the task existence before attempting to > kill/add instance. > > > Diffs > - > > > src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java > d8686f1f1b53e5ff2791663489e24c342503831e > src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java > 0583a63a175880355a7296ebd1c6e6fb5dc99f38 > src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/41226/diff/ > > > Testing > --- > > ./gradlew -Pq build > > > Thanks, > > Maxim Khutornenko > >