Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-21 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review149954
---


Ship it!




Master (d3c5ca7) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 22, 2016, 5:38 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> ---
> 
> (Updated Sept. 22, 2016, 5:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
> https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> ##Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
> storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
> for (IScheduledTask scheduledTask : scheduledTasks) {
>   updateTaskAndExternalState(
>   storeProvider.getUnsafeTaskStore(),
>   Tasks.id(scheduledTask),
>   Optional.of(scheduledTask),
>   Optional.of(PENDING),
>   Optional.absent());
> }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ##Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 8556253fc11f6027316871eb9dc66d8627a77fe6 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> ---
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

2016-09-21 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/
---

Review request for Aurora, Stephan Erb and Zameer Manji.


Bugs: AURORA-1779
https://issues.apache.org/jira/browse/AURORA-1779


Repository: aurora


Description
---

##Problem

The current `BatchWorker` error handling assumes graceful recovery if any of 
the individual batch items fail. This may not hold true if the storage is 
modified _before_ the error is thrown and the `LogStorage` transaction is not 
rolled back. Consider the following fragment from the `StateManagerImpl`:
```
storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);

for (IScheduledTask scheduledTask : scheduledTasks) {
  updateTaskAndExternalState(
  storeProvider.getUnsafeTaskStore(),
  Tasks.id(scheduledTask),
  Optional.of(scheduledTask),
  Optional.of(PENDING),
  Optional.absent());
}
```
If a task is saved but the subsequent `updateTaskAndExternalState()` throws AND 
`BatchWorker` catches and logs that exception, the native store transaction 
will be committed and as a result the storage will be left in a logically 
inconsistent state.

##Solution

I can see at least 3 ways to solve this problem:
1. Fail hard and shutdown scheduler when any of the batch items throw;
2. Fail only the last batch (drop all its items) and let storage transaction 
roll back;
3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
failed item would be retried when the storage transaction rolls back;

The #3 is the most involved and would be very hard to get right (assuming batch 
items are idempotent, which may not be the case). The #2 may result in a very 
hard to troubleshoot scenario where _some_ items would be dropped on the floor 
and never completed. 

I suggest we go with #1 as the most straightforward and transparent approach. 
It's also the only one that ensures storage state consistency, which is the 
most vital invariant to preserve (as AURORA-1779 proves). The only downside of 
this approach is that scheduler will go down hard any time an unhandled error 
is thrown but arguably this is the easiest way to improve our codebase and 
certainly better than leaving the scheduler in a crippled state.


Diffs
-

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
e05d4b48749b0691902a505bea1b4490fdd08f29 
  src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
2ec3967ddb1d470cf681de062a6400f647978185 
  src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
7c8047a7235a937c29fe96517242923ff84a080c 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
d390c07522d22e43d79ce4370985f3643ef021ca 
  src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
  src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
8556253fc11f6027316871eb9dc66d8627a77fe6 

Diff: https://reviews.apache.org/r/52141/diff/


Testing
---

unit and e2e tests


Thanks,

Maxim Khutornenko



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, line 149
> > 
> >
> > leftover?

will remove it.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py, lines 184-185
> > 
> >
> > Are there any guarantees the status_manager will always deal with 
> > `ThermosTaskRunner.EXIT_STATE_MAP` values? The way I see it any unmapped 
> > state will end up not calling the `_shutdown()` at all. Perhaps a safer way 
> > could be having an explicit pair of `_running()` to be called only for 
> > RUNNING state and `_shutdown()` act as a default callback for anything 
> > else? Curious what others think here.
> 
> Kai Huang wrote:
> A safer way is to map all mesos_pb2.TaskState to shutdown(). 
> 
> But I think self.TERMINAL_STATES should be sufficient, for the reason 
> that all status checkers only send terminal status(except for health checker, 
> it also sends TASK_RUNNING). In this way, TERMINAL_STATES will reflect its 
> meaning(If we decide to expand the terminal state set, we can add it in 
> Executor base).

However, I agree that we should pass default callback to StatusManager


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src

Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py, lines 184-185
> > 
> >
> > Are there any guarantees the status_manager will always deal with 
> > `ThermosTaskRunner.EXIT_STATE_MAP` values? The way I see it any unmapped 
> > state will end up not calling the `_shutdown()` at all. Perhaps a safer way 
> > could be having an explicit pair of `_running()` to be called only for 
> > RUNNING state and `_shutdown()` act as a default callback for anything 
> > else? Curious what others think here.

A safer way is to map all mesos_pb2.TaskState to shutdown(). 

But I think self.TERMINAL_STATES should be sufficient, for the reason that all 
status checkers only send terminal status(except for health checker, it also 
sends TASK_RUNNING). In this way, TERMINAL_STATES will reflect its meaning(If 
we decide to expand the terminal state set, we can add it in Executor base).


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, lines 
> > 162-164
> > 
> >
> > This should be moved into the `_maybe_update_failure_count()` to have 
> > all decision making logic in a single place.

sure, it make sense to remove "initial_in_progress" all together here.


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, lines 
> > 279-283
> > 
> >
> > This logic needs to be refactored to reduce duplication and reuse 
> > what's now in `is_health_check_enabled` as much as possible. Ideally, we 
> > should have the only place we extract `health_checker` and the like.

The complexity is that the health checker did some computation to calculate the 
port map while going through the if condition. It would be helpful to introduce 
a utitlity class to store all the side-effect values like port map.


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, line 234
> > 
> >
> > This should be `mesos_pb2.TASK_RUNNING`, right?

Sure, it seems the status was used in a mixed pattern in the code base. I will 
refactor all occurrences.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/t

Re: Review Request 52074: switching from launchTask to acceptOffers

2016-09-21 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52074/#review149883
---


Ship it!




Master (8432894) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 21, 2016, 6:51 p.m., Dmitriy Shirchenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52074/
> ---
> 
> (Updated Sept. 21, 2016, 6:51 p.m.)
> 
> 
> Review request for Aurora, Maxim Khutornenko, Mehrdad Nurolahzade, and Zameer 
> Manji.
> 
> 
> Bugs: AURORA-1776
> https://issues.apache.org/jira/browse/AURORA-1776
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Dynamic reservation requires us to use acceptOffers so this is in preparation 
> of that work. acceptOffers is the call to dynamically reserve resources.
> 
> 
> Diffs
> -
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeDriver.java 
> 1bfd8dbf3aec314aaeb89b756ea3a2049df8777a 
>   src/main/java/org/apache/aurora/scheduler/mesos/Driver.java 
> 14481468926b82a9bad3ee78c85f176e8ba893e2 
>   src/main/java/org/apache/aurora/scheduler/mesos/SchedulerDriverService.java 
> 41b9aab55aa439b13eba3545703ec97b44690444 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
> 925c025d21bf1d44e0c1d319f6653ecfa8899481 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
> a739bce1226d9435fa7d0b18e411064a4e78e49e 
> 
> Diff: https://reviews.apache.org/r/52074/diff/
> 
> 
> Testing
> ---
> 
> src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Dmitriy Shirchenko
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 6:25 p.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, line 135
> > 
> >
> > Should this be a call to the `initial_in_progress` function? If so, 
> > this is indicative of a lack of test coverage, a bug of this caliber should 
> > definitely be causing a test failure somewhere ;).

Maxim suggested moving the logic into one place: maybe_update_failure_count. So 
in this way, our test will cover it.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149859
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 52074: switching from launchTask to acceptOffers

2016-09-21 Thread Dmitriy Shirchenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52074/
---

(Updated Sept. 21, 2016, 6:51 p.m.)


Review request for Aurora, Maxim Khutornenko, Mehrdad Nurolahzade, and Zameer 
Manji.


Bugs: AURORA-1776
https://issues.apache.org/jira/browse/AURORA-1776


Repository: aurora


Description
---

Dynamic reservation requires us to use acceptOffers so this is in preparation 
of that work. acceptOffers is the call to dynamically reserve resources.


Diffs (updated)
-

  src/jmh/java/org/apache/aurora/benchmark/fakes/FakeDriver.java 
1bfd8dbf3aec314aaeb89b756ea3a2049df8777a 
  src/main/java/org/apache/aurora/scheduler/mesos/Driver.java 
14481468926b82a9bad3ee78c85f176e8ba893e2 
  src/main/java/org/apache/aurora/scheduler/mesos/SchedulerDriverService.java 
41b9aab55aa439b13eba3545703ec97b44690444 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
925c025d21bf1d44e0c1d319f6653ecfa8899481 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
a739bce1226d9435fa7d0b18e411064a4e78e49e 

Diff: https://reviews.apache.org/r/52074/diff/


Testing
---

src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh


Thanks,

Dmitriy Shirchenko



Re: Review Request 52074: switching from launchTask to acceptOffers

2016-09-21 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52074/#review149875
---



Master (8432894) is red with this patch.
  ./build-support/jenkins/build.sh

:commons:generateThriftResources
:commons:processResources
:commons:classes
:commons:jar
:compileJava/home/jenkins/jenkins-slave/workspace/AuroraBot/src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java:74:
 Note: Wrote forwarder 
org.apache.aurora.scheduler.storage.log.WriteAheadStorageForwarder
@Forward({
^
Note: Writing 
file:/home/jenkins/jenkins-slave/workspace/AuroraBot/dist/classes/main/org/apache/aurora/common/args/apt/cmdline.arg.info.txt.2
Note: Writing 
file:/home/jenkins/jenkins-slave/workspace/AuroraBot/dist/classes/main/META-INF/compiler/resource-mappings/org.apache.aurora.common.args.apt.CmdLineProcessor

:generateBuildProperties
:processResources
:classes
:jar
:startScripts
:distTar
:distZip
:assemble
:compileJmhJavaNote: 
/home/jenkins/jenkins-slave/workspace/AuroraBot/src/jmh/java/org/apache/aurora/benchmark/fakes/FakeSchedulerDriver.java
 uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

:processJmhResources UP-TO-DATE
:jmhClasses
:checkstyleJmh
:jsHint
:checkstyleMain[ant:checkstyle] 
/home/jenkins/jenkins-slave/workspace/AuroraBot/src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java:17:8:
 error: Unused import - java.util.List.
 FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':checkstyleMain'.
> Checkstyle rule violations were found. See the report at: 
> file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/checkstyle/main.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output.

BUILD FAILED

Total time: 3 mins 11.525 secs


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 21, 2016, 6:23 p.m., Dmitriy Shirchenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52074/
> ---
> 
> (Updated Sept. 21, 2016, 6:23 p.m.)
> 
> 
> Review request for Aurora, Maxim Khutornenko, Mehrdad Nurolahzade, and Zameer 
> Manji.
> 
> 
> Bugs: AURORA-1776
> https://issues.apache.org/jira/browse/AURORA-1776
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Dynamic reservation requires us to use acceptOffers so this is in preparation 
> of that work. acceptOffers is the call to dynamically reserve resources.
> 
> 
> Diffs
> -
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeDriver.java 
> 1bfd8dbf3aec314aaeb89b756ea3a2049df8777a 
>   src/main/java/org/apache/aurora/scheduler/mesos/Driver.java 
> 14481468926b82a9bad3ee78c85f176e8ba893e2 
>   src/main/java/org/apache/aurora/scheduler/mesos/SchedulerDriverService.java 
> 41b9aab55aa439b13eba3545703ec97b44690444 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
> 925c025d21bf1d44e0c1d319f6653ecfa8899481 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
> a739bce1226d9435fa7d0b18e411064a4e78e49e 
> 
> Diff: https://reviews.apache.org/r/52074/diff/
> 
> 
> Testing
> ---
> 
> src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Dmitriy Shirchenko
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149859
---




src/main/python/apache/aurora/executor/aurora_executor.py (lines 184 - 185)


Generally preferable to use a comprehension rather than calling map...

for terminal_state in set([v.status for k, v in 
ThermosTaskRunner.EXIT_STATE_MAP.items()]):
  ...

Alternately you could avoid iterating twice by invoking register directly 
from within the comprehension:

[self._register(v.status, self._shutdown) for _, v in 
ThermosTaskRunner.EXIT_STATE_MAP.items()]

You'd have to move the deduping to `_register` though, since we wouldn't be 
creating the intermediate set to strip the duplicate statuses.

and, on further investigation, *none* of the above should be necessary. We 
already have a set of terminal states defined on `ExecutorBase` which this 
class extends: 
https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/executor_base.py#L30-L35.
 So you should just be able to do:

for terminal_state in self.TERMINAL_STATES:
  self._register(terminal_state, self._shutdown)



src/main/python/apache/aurora/executor/common/health_checker.py (line 128)


Should this be a call to the `initial_in_progress` function? If so, this is 
indicative of a lack of test coverage, a bug of this caliber should definitely 
be causing a test failure somewhere ;).


- Joshua Cohen


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/exe

Re: Review Request 52074: switching from launchTask to acceptOffers

2016-09-21 Thread Dmitriy Shirchenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52074/
---

(Updated Sept. 21, 2016, 6:23 p.m.)


Review request for Aurora, Maxim Khutornenko, Mehrdad Nurolahzade, and Zameer 
Manji.


Bugs: AURORA-1776
https://issues.apache.org/jira/browse/AURORA-1776


Repository: aurora


Description
---

Dynamic reservation requires us to use acceptOffers so this is in preparation 
of that work. acceptOffers is the call to dynamically reserve resources.


Diffs (updated)
-

  src/jmh/java/org/apache/aurora/benchmark/fakes/FakeDriver.java 
1bfd8dbf3aec314aaeb89b756ea3a2049df8777a 
  src/main/java/org/apache/aurora/scheduler/mesos/Driver.java 
14481468926b82a9bad3ee78c85f176e8ba893e2 
  src/main/java/org/apache/aurora/scheduler/mesos/SchedulerDriverService.java 
41b9aab55aa439b13eba3545703ec97b44690444 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
925c025d21bf1d44e0c1d319f6653ecfa8899481 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
a739bce1226d9435fa7d0b18e411064a4e78e49e 

Diff: https://reviews.apache.org/r/52074/diff/


Testing
---

src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh


Thanks,

Dmitriy Shirchenko



Re: Review Request 52074: switching from launchTask to acceptOffers

2016-09-21 Thread Dmitriy Shirchenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52074/
---

(Updated Sept. 21, 2016, 6:19 p.m.)


Review request for Aurora, Maxim Khutornenko, Mehrdad Nurolahzade, and Zameer 
Manji.


Bugs: AURORA-1776
https://issues.apache.org/jira/browse/AURORA-1776


Repository: aurora


Description
---

Dynamic reservation requires us to use acceptOffers so this is in preparation 
of that work. acceptOffers is the call to dynamically reserve resources.


Diffs (updated)
-

  src/jmh/java/org/apache/aurora/benchmark/fakes/FakeDriver.java 
1bfd8dbf3aec314aaeb89b756ea3a2049df8777a 
  src/main/java/org/apache/aurora/scheduler/mesos/Driver.java 
14481468926b82a9bad3ee78c85f176e8ba893e2 
  src/main/java/org/apache/aurora/scheduler/mesos/SchedulerDriverService.java 
41b9aab55aa439b13eba3545703ec97b44690444 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
925c025d21bf1d44e0c1d319f6653ecfa8899481 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
a739bce1226d9435fa7d0b18e411064a4e78e49e 

Diff: https://reviews.apache.org/r/52074/diff/


Testing
---

src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh


Thanks,

Dmitriy Shirchenko



Re: Review Request 52074: switching from launchTask to acceptOffers

2016-09-21 Thread Dmitriy Shirchenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52074/#review149869
---




src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
(lines 50 - 54)


Agreed. I have been told that Aurora's philosophy is convention first, so 
I'm following what has been set.


- Dmitriy Shirchenko


On Sept. 20, 2016, 7:14 p.m., Dmitriy Shirchenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52074/
> ---
> 
> (Updated Sept. 20, 2016, 7:14 p.m.)
> 
> 
> Review request for Aurora, Maxim Khutornenko, Mehrdad Nurolahzade, and Zameer 
> Manji.
> 
> 
> Bugs: AURORA-1776
> https://issues.apache.org/jira/browse/AURORA-1776
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Dynamic reservation requires us to use acceptOffers so this is in preparation 
> of that work. acceptOffers is the call to dynamically reserve resources.
> 
> 
> Diffs
> -
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeDriver.java 
> 1bfd8dbf3aec314aaeb89b756ea3a2049df8777a 
>   src/main/java/org/apache/aurora/scheduler/mesos/Driver.java 
> 14481468926b82a9bad3ee78c85f176e8ba893e2 
>   src/main/java/org/apache/aurora/scheduler/mesos/SchedulerDriverService.java 
> 41b9aab55aa439b13eba3545703ec97b44690444 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
> 925c025d21bf1d44e0c1d319f6653ecfa8899481 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
> a739bce1226d9435fa7d0b18e411064a4e78e49e 
> 
> Diff: https://reviews.apache.org/r/52074/diff/
> 
> 
> Testing
> ---
> 
> src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Dmitriy Shirchenko
> 
>



Re: Review Request 52074: switching from launchTask to acceptOffers

2016-09-21 Thread Dmitriy Shirchenko


> On Sept. 20, 2016, 7:41 p.m., Mehrdad Nurolahzade wrote:
> > src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java, 
> > lines 52-56
> > 
> >
> > We should start disallowing or limiting import static. It causes poor 
> > readability. 
> > 
> > I was looking for a makeTask() method on this file.

Agreed. I have been told that Aurora's philosophy is convention first, so I'm 
following what has been set.


- Dmitriy


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52074/#review149714
---


On Sept. 20, 2016, 7:14 p.m., Dmitriy Shirchenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52074/
> ---
> 
> (Updated Sept. 20, 2016, 7:14 p.m.)
> 
> 
> Review request for Aurora, Maxim Khutornenko, Mehrdad Nurolahzade, and Zameer 
> Manji.
> 
> 
> Bugs: AURORA-1776
> https://issues.apache.org/jira/browse/AURORA-1776
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Dynamic reservation requires us to use acceptOffers so this is in preparation 
> of that work. acceptOffers is the call to dynamically reserve resources.
> 
> 
> Diffs
> -
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeDriver.java 
> 1bfd8dbf3aec314aaeb89b756ea3a2049df8777a 
>   src/main/java/org/apache/aurora/scheduler/mesos/Driver.java 
> 14481468926b82a9bad3ee78c85f176e8ba893e2 
>   src/main/java/org/apache/aurora/scheduler/mesos/SchedulerDriverService.java 
> 41b9aab55aa439b13eba3545703ec97b44690444 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
> 925c025d21bf1d44e0c1d319f6653ecfa8899481 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
> a739bce1226d9435fa7d0b18e411064a4e78e49e 
> 
> Diff: https://reviews.apache.org/r/52074/diff/
> 
> 
> Testing
> ---
> 
> src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Dmitriy Shirchenko
> 
>



Re: Review Request 52074: switching from launchTask to acceptOffers

2016-09-21 Thread Dmitriy Shirchenko


> On Sept. 21, 2016, 5:54 p.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java, line 372
> > 
> >
> > Can be inlined below.

Done.


> On Sept. 21, 2016, 5:54 p.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/SchedulerDriverService.java,
> >  line 124
> > 
> >
> > nit: add newline above

Done.


- Dmitriy


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52074/#review149862
---


On Sept. 20, 2016, 7:14 p.m., Dmitriy Shirchenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52074/
> ---
> 
> (Updated Sept. 20, 2016, 7:14 p.m.)
> 
> 
> Review request for Aurora, Maxim Khutornenko, Mehrdad Nurolahzade, and Zameer 
> Manji.
> 
> 
> Bugs: AURORA-1776
> https://issues.apache.org/jira/browse/AURORA-1776
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Dynamic reservation requires us to use acceptOffers so this is in preparation 
> of that work. acceptOffers is the call to dynamically reserve resources.
> 
> 
> Diffs
> -
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeDriver.java 
> 1bfd8dbf3aec314aaeb89b756ea3a2049df8777a 
>   src/main/java/org/apache/aurora/scheduler/mesos/Driver.java 
> 14481468926b82a9bad3ee78c85f176e8ba893e2 
>   src/main/java/org/apache/aurora/scheduler/mesos/SchedulerDriverService.java 
> 41b9aab55aa439b13eba3545703ec97b44690444 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
> 925c025d21bf1d44e0c1d319f6653ecfa8899481 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
> a739bce1226d9435fa7d0b18e411064a4e78e49e 
> 
> Diff: https://reviews.apache.org/r/52074/diff/
> 
> 
> Testing
> ---
> 
> src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Dmitriy Shirchenko
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/status_manager.py, line 68
> > 
> >
> > This is exactly the problem I alluded to above. What if there is no 
> > entry for a given `status` here?

Yeah, I agree it's safer to make shutdown a default behavior here.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > api/src/main/thrift/org/apache/aurora/gen/api.thrift, lines 1218-1220
> > 
> >
> > What's the motivation behind moving these constants into the thrift 
> > layer? FWICT, these are only used inside the `health_checker.py` and should 
> > stay there.

If we want to move the is_health_check_enable logic into task_info.py. In 
task_info.py, We will need to use these constants defined in health_checker.py, 
which itself depends on task_info.py. So we are running into a circular 
dependency issue here. To avoid duplicating the constant, I made these 
constants global.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 52074: switching from launchTask to acceptOffers

2016-09-21 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52074/#review149862
---


Ship it!





src/main/java/org/apache/aurora/scheduler/mesos/SchedulerDriverService.java 
(line 124)


nit: add newline above



src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java (line 372)


Can be inlined below.


- Maxim Khutornenko


On Sept. 20, 2016, 7:14 p.m., Dmitriy Shirchenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52074/
> ---
> 
> (Updated Sept. 20, 2016, 7:14 p.m.)
> 
> 
> Review request for Aurora, Maxim Khutornenko, Mehrdad Nurolahzade, and Zameer 
> Manji.
> 
> 
> Bugs: AURORA-1776
> https://issues.apache.org/jira/browse/AURORA-1776
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Dynamic reservation requires us to use acceptOffers so this is in preparation 
> of that work. acceptOffers is the call to dynamically reserve resources.
> 
> 
> Diffs
> -
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeDriver.java 
> 1bfd8dbf3aec314aaeb89b756ea3a2049df8777a 
>   src/main/java/org/apache/aurora/scheduler/mesos/Driver.java 
> 14481468926b82a9bad3ee78c85f176e8ba893e2 
>   src/main/java/org/apache/aurora/scheduler/mesos/SchedulerDriverService.java 
> 41b9aab55aa439b13eba3545703ec97b44690444 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
> 925c025d21bf1d44e0c1d319f6653ecfa8899481 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
> a739bce1226d9435fa7d0b18e411064a4e78e49e 
> 
> Diff: https://reviews.apache.org/r/52074/diff/
> 
> 
> Testing
> ---
> 
> src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Dmitriy Shirchenko
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149864
---




src/main/python/apache/aurora/executor/common/health_checker.py (lines 153 - 
155)


A typo here, it should be:

if NOT self.health_check_passed:
   log.warning


- Kai Huang


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Maxim Khutornenko


> On Sept. 20, 2016, 12:28 a.m., Kai Huang wrote:
> > src/main/python/apache/aurora/executor/common/status_checker.py, line 104
> > 
> >
> > Use TaskState.Value('TASK_RUNNING') here instead of 
> > mesos_pb2.TASK_RUNNING, because this file also used TaskState in multiple 
> > places.

Not sure I get this. What prevents you from using `mesos_pb2.TASK_RUNNING` here 
and keep `TaskState` in other places?


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149581
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---




api/src/main/thrift/org/apache/aurora/gen/api.thrift (lines 1218 - 1220)


What's the motivation behind moving these constants into the thrift layer? 
FWICT, these are only used inside the `health_checker.py` and should stay there.



src/main/python/apache/aurora/executor/aurora_executor.py (line 80)


This does not have to be a global field, does it? It looks like 
`_register()` only adds callbacks but nothing ever reads from it.



src/main/python/apache/aurora/executor/aurora_executor.py (lines 184 - 185)


Are there any guarantees the status_manager will always deal with 
`ThermosTaskRunner.EXIT_STATE_MAP` values? The way I see it any unmapped state 
will end up not calling the `_shutdown()` at all. Perhaps a safer way could be 
having an explicit pair of `_running()` to be called only for RUNNING state and 
`_shutdown()` act as a default callback for anything else? Curious what others 
think here.



src/main/python/apache/aurora/executor/common/health_checker.py (line 140)


leftover?



src/main/python/apache/aurora/executor/common/health_checker.py (lines 153 - 
155)


This should be moved into the `_maybe_update_failure_count()` to have all 
decision making logic in a single place.



src/main/python/apache/aurora/executor/common/health_checker.py (line 225)


This should be `mesos_pb2.TASK_RUNNING`, right?



src/main/python/apache/aurora/executor/common/health_checker.py (lines 270 - 
274)


This logic needs to be refactored to reduce duplication and reuse what's 
now in `is_health_check_enabled` as much as possible. Ideally, we should have 
the only place we extract `health_checker` and the like.



src/main/python/apache/aurora/executor/status_manager.py (line 68)


This is exactly the problem I alluded to above. What if there is no entry 
for a given `status` here?


- Maxim Khutornenko


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/ex