Re: Review Request 41226: Handling task event race in updater.

2015-12-16 Thread John Sirois

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41226/#review110696
---

Ship it!


Master (fb8155d) is green with this patch.
  true

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- John Sirois


On Dec. 11, 2015, 1:33 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41226/
> ---
> 
> (Updated Dec. 11, 2015, 1:33 a.m.)
> 
> 
> Review request for Aurora and Bill Farner.
> 
> 
> Bugs: AURORA-1506 and AURORA-1507
> https://issues.apache.org/jira/browse/AURORA-1506
> https://issues.apache.org/jira/browse/AURORA-1507
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Task event processing delays may generate duplicate evaluation results in the 
> `InstanceActionHandler`. This leads to occasional stack traces in the log but 
> is generally benign as `JobUpdateEventSubscriber` catches these errors. 
> 
> The easiest fix is to check for the task existence before attempting to 
> kill/add instance.
> 
> 
> Diffs
> -
> 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> d8686f1f1b53e5ff2791663489e24c342503831e 
>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
> 0583a63a175880355a7296ebd1c6e6fb5dc99f38 
>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/41226/diff/
> 
> 
> Testing
> ---
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 41226: Handling task event race in updater.

2015-12-15 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41226/#review110614
---


Ping.

- Maxim Khutornenko


On Dec. 11, 2015, 1:33 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41226/
> ---
> 
> (Updated Dec. 11, 2015, 1:33 a.m.)
> 
> 
> Review request for Aurora and Bill Farner.
> 
> 
> Bugs: AURORA-1506 and AURORA-1507
> https://issues.apache.org/jira/browse/AURORA-1506
> https://issues.apache.org/jira/browse/AURORA-1507
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Task event processing delays may generate duplicate evaluation results in the 
> `InstanceActionHandler`. This leads to occasional stack traces in the log but 
> is generally benign as `JobUpdateEventSubscriber` catches these errors. 
> 
> The easiest fix is to check for the task existence before attempting to 
> kill/add instance.
> 
> 
> Diffs
> -
> 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> d8686f1f1b53e5ff2791663489e24c342503831e 
>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
> 0583a63a175880355a7296ebd1c6e6fb5dc99f38 
>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/41226/diff/
> 
> 
> Testing
> ---
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 41226: Handling task event race in updater.

2015-12-11 Thread Maxim Khutornenko


> On Dec. 11, 2015, 7:07 p.m., Bill Farner wrote:
> > Can you take a stab at illustrating in the description how this problem 
> > manifests?  The current shape of the patch feels like it papers over a 
> > problem that we might be left to solve again, so thinking about the problem 
> > at a high level might help us think about a broader fix.

I agree it may feel somewhat hacky but it's the easiest way to handle 
updater/event race without adding more moving parts or excessive locking.

The core of the problem is (as mentioned in the description) in the way we 
process task events. After moving into async processing, events can be (and 
are) delayed. It's a hard problem to follow but I'll try to explain with an 
example of AURORA-1506:
- instance 1 fails while been watched by the updater, event F is dispatched 
(e.g.: RUNNING->FAILED)
- instance 1 is rescheduled and event R is dispatched (e.g.: INIT->THROTTLED)
- event F reached updater and updater decides to rollback, issues a kill for 
the instance intending to return to previous config
- THROTTLED task is deleted from the store without firing a task change event 
(but firing a task deleted event)
- event R is finally processed by async bus and updater is notified. Updater 
follows the evaluate path and "thinks" a KILL is still required, which is legit 
but does not make sense as the task is no longer in the store.

So, the crux of the problem is entering updater evaluator loop with a stale 
task info. Figuring out if the task info is stale at the subscriber point is 
impossible as it requires updater inner state. The best, as it seems to me, let 
the updater figure out what to do but prevent the state changing action if it 
conflicts with the store. Sort of optimistic evaluation run.

Now to the question: The only downside of this race is a stack trace in the 
log. The actual errors are handled by the `JobUpdateEventSubscriber` and do not 
fail the update. These errors are polluting our logs and monitoring stats.


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41226/#review109984
---


On Dec. 11, 2015, 1:33 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41226/
> ---
> 
> (Updated Dec. 11, 2015, 1:33 a.m.)
> 
> 
> Review request for Aurora and Bill Farner.
> 
> 
> Bugs: AURORA-1506 and AURORA-1507
> https://issues.apache.org/jira/browse/AURORA-1506
> https://issues.apache.org/jira/browse/AURORA-1507
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Task event processing delays may generate duplicate evaluation results in the 
> `InstanceActionHandler`. This leads to occasional stack traces in the log but 
> is generally benign as `JobUpdateEventSubscriber` catches these errors. 
> 
> The easiest fix is to check for the task existence before attempting to 
> kill/add instance.
> 
> 
> Diffs
> -
> 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> d8686f1f1b53e5ff2791663489e24c342503831e 
>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
> 0583a63a175880355a7296ebd1c6e6fb5dc99f38 
>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/41226/diff/
> 
> 
> Testing
> ---
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 41226: Handling task event race in updater.

2015-12-11 Thread Bill Farner

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41226/#review109984
---


Can you take a stab at illustrating in the description how this problem 
manifests?  The current shape of the patch feels like it papers over a problem 
that we might be left to solve again, so thinking about the problem at a high 
level might help us think about a broader fix.

- Bill Farner


On Dec. 10, 2015, 5:33 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41226/
> ---
> 
> (Updated Dec. 10, 2015, 5:33 p.m.)
> 
> 
> Review request for Aurora and Bill Farner.
> 
> 
> Bugs: AURORA-1506 and AURORA-1507
> https://issues.apache.org/jira/browse/AURORA-1506
> https://issues.apache.org/jira/browse/AURORA-1507
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Task event processing delays may generate duplicate evaluation results in the 
> `InstanceActionHandler`. This leads to occasional stack traces in the log but 
> is generally benign as `JobUpdateEventSubscriber` catches these errors. 
> 
> The easiest fix is to check for the task existence before attempting to 
> kill/add instance.
> 
> 
> Diffs
> -
> 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> d8686f1f1b53e5ff2791663489e24c342503831e 
>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
> 0583a63a175880355a7296ebd1c6e6fb5dc99f38 
>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/41226/diff/
> 
> 
> Testing
> ---
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 41226: Handling task event race in updater.

2015-12-10 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41226/
---

(Updated Dec. 11, 2015, 1:33 a.m.)


Review request for Aurora and Bill Farner.


Changes
---

Realized AURORA-1507 has the same root cause. Bundling it into this diff.


Summary (updated)
-

Handling task event race in updater.


Bugs: AURORA-1506 and AURORA-1507
https://issues.apache.org/jira/browse/AURORA-1506
https://issues.apache.org/jira/browse/AURORA-1507


Repository: aurora


Description (updated)
---

Task event processing delays may generate duplicate evaluation results in the 
`InstanceActionHandler`. This leads to occasional stack traces in the log but 
is generally benign as `JobUpdateEventSubscriber` catches these errors. 

The easiest fix is to check for the task existence before attempting to 
kill/add instance.


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
d8686f1f1b53e5ff2791663489e24c342503831e 
  src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
0583a63a175880355a7296ebd1c6e6fb5dc99f38 
  src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
PRE-CREATION 

Diff: https://reviews.apache.org/r/41226/diff/


Testing
---

./gradlew -Pq build


Thanks,

Maxim Khutornenko



Re: Review Request 41226: Handling task event race in updater.

2015-12-10 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41226/#review109906
---

Ship it!


Master (6933a71) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Dec. 11, 2015, 1:33 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41226/
> ---
> 
> (Updated Dec. 11, 2015, 1:33 a.m.)
> 
> 
> Review request for Aurora and Bill Farner.
> 
> 
> Bugs: AURORA-1506 and AURORA-1507
> https://issues.apache.org/jira/browse/AURORA-1506
> https://issues.apache.org/jira/browse/AURORA-1507
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Task event processing delays may generate duplicate evaluation results in the 
> `InstanceActionHandler`. This leads to occasional stack traces in the log but 
> is generally benign as `JobUpdateEventSubscriber` catches these errors. 
> 
> The easiest fix is to check for the task existence before attempting to 
> kill/add instance.
> 
> 
> Diffs
> -
> 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> d8686f1f1b53e5ff2791663489e24c342503831e 
>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
> 0583a63a175880355a7296ebd1c6e6fb5dc99f38 
>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/41226/diff/
> 
> 
> Testing
> ---
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>