Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-30 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review151050
---


Ship it!




Master (59b4d31) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 30, 2016, 5:17 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 30, 2016, 5:17 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-30 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review151045
---


Ship it!




Ship It!

- Joshua Cohen


On Sept. 30, 2016, 5:17 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 30, 2016, 5:17 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-30 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/
---

(Updated Sept. 30, 2016, 5:17 p.m.)


Review request for Aurora, Joshua Cohen and Zameer Manji.


Changes
---

Add test case for the scenario where a status provider throws during 
initialization.


Bugs: AURORA-1225
https://issues.apache.org/jira/browse/AURORA-1225


Repository: aurora


Description
---

Modify executor state transition logic to rely on health checks (if enabled).

[Summary]
Executor needs to start executing user content in STARTING and transition to 
RUNNING when a successful required number of health checks is reached.

This review contains a series of executor changes that implement the health 
check driven updates. It gives more context of the design of this feature.

[Background]
Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
and the design doc: 
https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
 for more details and background.

[Description]
If health check is enabled on vCurrent executor, the health checker will send a 
"TASK_RUNNING" message when a successful required number of health checks is 
reached within the initial_interval_secs. On the other hand, a "TASK_FAILED" 
message was sent if the health checker fails to reach the required number of 
health checks within that period, or a maximum number of failed health check 
limit is reached after the initital_interval_secs.

If health check is disabled on the vCurrent executor, it will sends 
"TASK_RUNNING" message to scheduler after the thermos runner was started. In 
this scenario, the behavior of vCurrent executor will be the same as the vPrev 
executor.

[Change List]
The current change set includes:
1. Removed the status memoization in ChainedStatusChecker.
2. Modified the StatusManager to be edge triggered.
3. Changed the Aurora Executor callback function.
4. Modified the Health Checker and redefined the meaning initial_interval_secs.


Diffs (updated)
-

  src/main/python/apache/aurora/executor/aurora_executor.py 
ce5ef680f01831cd89fced8969ae3246c7f60cfd 
  src/main/python/apache/aurora/executor/common/health_checker.py 
5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
  src/main/python/apache/aurora/executor/common/status_checker.py 
795dae2d6b661fc528d952c2315196d94127961f 
  src/main/python/apache/aurora/executor/status_manager.py 
228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
  src/test/python/apache/aurora/executor/common/test_status_checker.py 
5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
  src/test/python/apache/aurora/executor/test_status_manager.py 
ce4679ba1aa7b42cf0115c943d84663030182d23 
  src/test/python/apache/aurora/executor/test_thermos_executor.py 
0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 

Diff: https://reviews.apache.org/r/51876/diff/


Testing
---

./build-support/jenkins/build.sh

./pants test.pytest src/test/python/apache/aurora/executor::


Thanks,

Kai Huang



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-30 Thread Joshua Cohen


> On Sept. 30, 2016, 4:11 p.m., Joshua Cohen wrote:
> > Can you add a test case for the scenario where a status provider throws 
> > during initialization?
> 
> Kai Huang wrote:
> Oh, that's a good point. Will do.

Thanks. e2e's passed for me locally w/ this patch, so once the tests are added 
this is good to go!


- Joshua


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review151029
---


On Sept. 29, 2016, 11:11 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 29, 2016, 11:11 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-30 Thread Kai Huang


> On Sept. 30, 2016, 4:11 p.m., Joshua Cohen wrote:
> > Can you add a test case for the scenario where a status provider throws 
> > during initialization?

Oh, that's a good point. Will do.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review151029
---


On Sept. 29, 2016, 11:11 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 29, 2016, 11:11 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-30 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review151029
---



Can you add a test case for the scenario where a status provider throws during 
initialization?

- Joshua Cohen


On Sept. 29, 2016, 11:11 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 29, 2016, 11:11 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-29 Thread Kai Huang


> On Sept. 28, 2016, 9:52 p.m., Joshua Cohen wrote:
> > I tried to commit this, but e2e tests hung for me.
> > 
> > Kai, can you investigate?
> 
> Kai Huang wrote:
> There is a bug of thermos kill for task running in docker 
> container(https://issues.apache.org/jira/browse/AURORA-1426 ). It is likely 
> due to my code change triggered this bug or exacerbate the effect of it. I'll 
> investigate this.

It turned out to be a bug related to my changes in aurora executor. Fixed it by 
adding missing error handling. See the new review request.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150769
---


On Sept. 29, 2016, 11:11 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 29, 2016, 11:11 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-29 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/
---

(Updated Sept. 29, 2016, 11:11 p.m.)


Review request for Aurora, Joshua Cohen and Zameer Manji.


Changes
---

Fixed a bug in aurora executor. Add error handling when setting up 
StatusCheckers.


Bugs: AURORA-1225
https://issues.apache.org/jira/browse/AURORA-1225


Repository: aurora


Description
---

Modify executor state transition logic to rely on health checks (if enabled).

[Summary]
Executor needs to start executing user content in STARTING and transition to 
RUNNING when a successful required number of health checks is reached.

This review contains a series of executor changes that implement the health 
check driven updates. It gives more context of the design of this feature.

[Background]
Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
and the design doc: 
https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
 for more details and background.

[Description]
If health check is enabled on vCurrent executor, the health checker will send a 
"TASK_RUNNING" message when a successful required number of health checks is 
reached within the initial_interval_secs. On the other hand, a "TASK_FAILED" 
message was sent if the health checker fails to reach the required number of 
health checks within that period, or a maximum number of failed health check 
limit is reached after the initital_interval_secs.

If health check is disabled on the vCurrent executor, it will sends 
"TASK_RUNNING" message to scheduler after the thermos runner was started. In 
this scenario, the behavior of vCurrent executor will be the same as the vPrev 
executor.

[Change List]
The current change set includes:
1. Removed the status memoization in ChainedStatusChecker.
2. Modified the StatusManager to be edge triggered.
3. Changed the Aurora Executor callback function.
4. Modified the Health Checker and redefined the meaning initial_interval_secs.


Diffs (updated)
-

  src/main/python/apache/aurora/executor/aurora_executor.py 
ce5ef680f01831cd89fced8969ae3246c7f60cfd 
  src/main/python/apache/aurora/executor/common/health_checker.py 
5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
  src/main/python/apache/aurora/executor/common/status_checker.py 
795dae2d6b661fc528d952c2315196d94127961f 
  src/main/python/apache/aurora/executor/status_manager.py 
228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
  src/test/python/apache/aurora/executor/common/test_status_checker.py 
5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
  src/test/python/apache/aurora/executor/test_status_manager.py 
ce4679ba1aa7b42cf0115c943d84663030182d23 
  src/test/python/apache/aurora/executor/test_thermos_executor.py 
0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 

Diff: https://reviews.apache.org/r/51876/diff/


Testing
---

./build-support/jenkins/build.sh

./pants test.pytest src/test/python/apache/aurora/executor::


Thanks,

Kai Huang



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-29 Thread Kai Huang


> On Sept. 28, 2016, 9:52 p.m., Joshua Cohen wrote:
> > I tried to commit this, but e2e tests hung for me.
> > 
> > Kai, can you investigate?

There is a bug of thermos kill for task running in docker 
container(https://issues.apache.org/jira/browse/AURORA-1426 ). It is likely due 
to my code change triggered this bug or exacerbate the effect of it. I'll 
investigate this.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150769
---


On Sept. 28, 2016, 9:07 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 28, 2016, 9:07 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-28 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150769
---



I tried to commit this, but e2e tests hung for me.

Kai, can you investigate?

- Joshua Cohen


On Sept. 28, 2016, 9:07 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 28, 2016, 9:07 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-28 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150764
---


Ship it!




Master (1c1c0a5) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 28, 2016, 9:07 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 28, 2016, 9:07 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-28 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/
---

(Updated Sept. 28, 2016, 9:07 p.m.)


Review request for Aurora, Joshua Cohen and Zameer Manji.


Changes
---

Explicitly check all callback types in StatusManager


Bugs: AURORA-1225
https://issues.apache.org/jira/browse/AURORA-1225


Repository: aurora


Description
---

Modify executor state transition logic to rely on health checks (if enabled).

[Summary]
Executor needs to start executing user content in STARTING and transition to 
RUNNING when a successful required number of health checks is reached.

This review contains a series of executor changes that implement the health 
check driven updates. It gives more context of the design of this feature.

[Background]
Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
and the design doc: 
https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
 for more details and background.

[Description]
If health check is enabled on vCurrent executor, the health checker will send a 
"TASK_RUNNING" message when a successful required number of health checks is 
reached within the initial_interval_secs. On the other hand, a "TASK_FAILED" 
message was sent if the health checker fails to reach the required number of 
health checks within that period, or a maximum number of failed health check 
limit is reached after the initital_interval_secs.

If health check is disabled on the vCurrent executor, it will sends 
"TASK_RUNNING" message to scheduler after the thermos runner was started. In 
this scenario, the behavior of vCurrent executor will be the same as the vPrev 
executor.

[Change List]
The current change set includes:
1. Removed the status memoization in ChainedStatusChecker.
2. Modified the StatusManager to be edge triggered.
3. Changed the Aurora Executor callback function.
4. Modified the Health Checker and redefined the meaning initial_interval_secs.


Diffs (updated)
-

  src/main/python/apache/aurora/executor/aurora_executor.py 
ce5ef680f01831cd89fced8969ae3246c7f60cfd 
  src/main/python/apache/aurora/executor/common/health_checker.py 
5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
  src/main/python/apache/aurora/executor/common/status_checker.py 
795dae2d6b661fc528d952c2315196d94127961f 
  src/main/python/apache/aurora/executor/status_manager.py 
228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
  src/test/python/apache/aurora/executor/common/test_status_checker.py 
5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
  src/test/python/apache/aurora/executor/test_status_manager.py 
ce4679ba1aa7b42cf0115c943d84663030182d23 
  src/test/python/apache/aurora/executor/test_thermos_executor.py 
0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 

Diff: https://reviews.apache.org/r/51876/diff/


Testing
---

./build-support/jenkins/build.sh

./pants test.pytest src/test/python/apache/aurora/executor::


Thanks,

Kai Huang



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-27 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150644
---


Ship it!




Master (69cba78) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 27, 2016, 10:25 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 27, 2016, 10:25 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-27 Thread Kai Huang


> On Sept. 27, 2016, 9:27 p.m., Stephan Erb wrote:
> > src/main/python/apache/aurora/executor/common/status_checker.py, line 104
> > 
> >
> > This is slightly beyond the pull request, but for the matter of 
> > generalization: We could only break here if the returned state is in 
> > `ExecutorBase.TERMINAL_STATES`.

It is possible the status is in a unknown or unmapped state, in this case it's 
safer to invoke the shutdown callback function.


> On Sept. 27, 2016, 9:27 p.m., Stephan Erb wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py, line 219
> > 
> >
> > The `status_result` is carrying that is healthy string already as its 
> > `reason`. To reduce duplication you should use that instead.

Thanks, will fix it.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150612
---


On Sept. 23, 2016, 6:58 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 6:58 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-27 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150612
---


Ship it!




LGTM. 

A couple of minor remarks below. I am late to the party and have not followed 
the previous discussion, so feel free to ignore everything that has been 
brought up before.


src/main/python/apache/aurora/executor/aurora_executor.py (line 180)


Mutable args make code hard to reason about. I'd prefer something like the 
following instead:

status_checkers = [self._kill_manager] + status_checkers

Alternatively, you could keep the default initalization of 
`status_checkers` in the calling function so that the kill manager is already 
present when `_start_status_manager` is called.



src/main/python/apache/aurora/executor/aurora_executor.py (line 213)


The `status_result` is carrying that is healthy string already as its 
`reason`. To reduce duplication you should use that instead.



src/main/python/apache/aurora/executor/common/status_checker.py (line 88)


This filed seems now unnecessary as memoization has been dropped. The 
`status()` method could be refactored accordingly.



src/main/python/apache/aurora/executor/common/status_checker.py (line 104)


This is slightly beyond the pull request, but for the matter of 
generalization: We could only break here if the returned state is in 
`ExecutorBase.TERMINAL_STATES`.


- Stephan Erb


On Sept. 23, 2016, 8:58 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 8:58 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> 

Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-27 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150628
---


Ship it!




lgtm overall, thanks for iterating!

Modulo the below comment, I agree with your comment above that we should export 
a metric for consecutive_successes. Once those changes are made I'll commit 
this.


src/main/python/apache/aurora/executor/status_manager.py (lines 48 - 50)


Would probably be more informative to just explicitly check callable on 
each callback so we can inform which one had a problem.


- Joshua Cohen


On Sept. 23, 2016, 6:58 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 6:58 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-26 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150428
---




src/main/python/apache/aurora/executor/common/health_checker.py (line 196)


We should export consecutive_successes as well?


- Kai Huang


On Sept. 23, 2016, 6:58 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 6:58 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-26 Thread Zameer Manji


> On Sept. 23, 2016, 3:18 p.m., Zameer Manji wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py, line 120
> > 
> >
> > This check is brittle to determine if health checking is enabled. 
> > Please consider an alternative approach.
> > 
> > You have access to the `assigned_task` object. Passing that in to the 
> > helper method `mesos_task_instance_from_assigned_task` will give you an 
> > instance of the executor config.
> > 
> > You can then check the `health_check_config()` property of the result 
> > to see if health checking is enabled for the task.
> 
> Kai Huang wrote:
> Seems like we are taking a step back? In the first revision, I 
> implemented pretty much the way you mentioned above, that is creating a 
> is_health_check_enabled(assigned_task) function in task_info.py. 
> 
> However, Maxim raised a valid point that is_health_check_enabled has some 
> duplication with the creation of health checker in later step.
> So the problem would be whether we should reuse the logic of 
> is_health_check_enabled in health_checker?
> 
> One solution is to store all the computation result(prot_map, 
> health_checker, health_check_config) in a utility class. So that it can be 
> reuse later. But a downside here is that the is_health_check_enabled now 
> serves multiple purposes, and the meaning of this function is not clear. It 
> should only answer one question: is health check enabled on this task? 
> 
> From my perspective, I think we should allow some sacrifice of 
> reusability here.

You make a good argument here, so I will drop my objection.


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150246
---


On Sept. 23, 2016, 11:58 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 11:58 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   

Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-26 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150427
---


Ship it!




Ship It!

- Zameer Manji


On Sept. 23, 2016, 11:58 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 11:58 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-23 Thread Kai Huang


> On Sept. 23, 2016, 10:18 p.m., Zameer Manji wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py, line 120
> > 
> >
> > This check is brittle to determine if health checking is enabled. 
> > Please consider an alternative approach.
> > 
> > You have access to the `assigned_task` object. Passing that in to the 
> > helper method `mesos_task_instance_from_assigned_task` will give you an 
> > instance of the executor config.
> > 
> > You can then check the `health_check_config()` property of the result 
> > to see if health checking is enabled for the task.

Seems like we are taking a step back? In the first revision, I implemented 
pretty much the way you mentioned above, that is creating a 
is_health_check_enabled(assigned_task) function in task_info.py. 

However, Maxim raised a valid point that is_health_check_enabled has some 
duplication with the creation of health checker in later step.
So the problem would be whether we should reuse the logic of 
is_health_check_enabled in health_checker?

One solution is to store all the computation result(prot_map, health_checker, 
health_check_config) in a utility class. So that it can be reuse later. But a 
downside here is that the is_health_check_enabled now serves multiple purposes, 
and the meaning of this function is not clear. It should only answer one 
question: is health check enabled on this task? 

>From my perspective, I think we should allow some sacrifice of reusability 
>here.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150246
---


On Sept. 23, 2016, 6:58 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 6:58 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: 

Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-23 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150246
---



I like the additional testing done.


src/main/python/apache/aurora/executor/aurora_executor.py (line 120)


This check is brittle to determine if health checking is enabled. Please 
consider an alternative approach.

You have access to the `assigned_task` object. Passing that in to the 
helper method `mesos_task_instance_from_assigned_task` will give you an 
instance of the executor config.

You can then check the `health_check_config()` property of the result to 
see if health checking is enabled for the task.


- Zameer Manji


On Sept. 23, 2016, 11:58 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 11:58 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-23 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150226
---


Ship it!




Master (4ead189) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 23, 2016, 6:58 p.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 6:58 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-23 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/
---

(Updated Sept. 23, 2016, 6:58 p.m.)


Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.


Changes
---

Modified the logic of checking if health check is enabled for a task.


Bugs: AURORA-1225
https://issues.apache.org/jira/browse/AURORA-1225


Repository: aurora


Description
---

Modify executor state transition logic to rely on health checks (if enabled).

[Summary]
Executor needs to start executing user content in STARTING and transition to 
RUNNING when a successful required number of health checks is reached.

This review contains a series of executor changes that implement the health 
check driven updates. It gives more context of the design of this feature.

[Background]
Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
and the design doc: 
https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
 for more details and background.

[Description]
If health check is enabled on vCurrent executor, the health checker will send a 
"TASK_RUNNING" message when a successful required number of health checks is 
reached within the initial_interval_secs. On the other hand, a "TASK_FAILED" 
message was sent if the health checker fails to reach the required number of 
health checks within that period, or a maximum number of failed health check 
limit is reached after the initital_interval_secs.

If health check is disabled on the vCurrent executor, it will sends 
"TASK_RUNNING" message to scheduler after the thermos runner was started. In 
this scenario, the behavior of vCurrent executor will be the same as the vPrev 
executor.

[Change List]
The current change set includes:
1. Removed the status memoization in ChainedStatusChecker.
2. Modified the StatusManager to be edge triggered.
3. Changed the Aurora Executor callback function.
4. Modified the Health Checker and redefined the meaning initial_interval_secs.


Diffs (updated)
-

  src/main/python/apache/aurora/executor/aurora_executor.py 
ce5ef680f01831cd89fced8969ae3246c7f60cfd 
  src/main/python/apache/aurora/executor/common/health_checker.py 
5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
  src/main/python/apache/aurora/executor/common/status_checker.py 
795dae2d6b661fc528d952c2315196d94127961f 
  src/main/python/apache/aurora/executor/status_manager.py 
228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
  src/test/python/apache/aurora/executor/common/test_status_checker.py 
5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
  src/test/python/apache/aurora/executor/test_status_manager.py 
ce4679ba1aa7b42cf0115c943d84663030182d23 
  src/test/python/apache/aurora/executor/test_thermos_executor.py 
0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 

Diff: https://reviews.apache.org/r/51876/diff/


Testing
---

./build-support/jenkins/build.sh

./pants test.pytest src/test/python/apache/aurora/executor::


Thanks,

Kai Huang



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-23 Thread Kai Huang


> On Sept. 23, 2016, 4:53 p.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py, line 81
> > 
> >
> > I don't think we should expose this simply for the sake of testing, 
> > besides breaking abstractions, it's also brittle. A bug could set this 
> > value to `True` even if the behavior we want to test is broken.
> > 
> > We should be able to test the health check logic directly based on the 
> > behavior of the executor. I.e. assert that the `TASK_RUNNING` update is not 
> > sent right away if one of the status providers is a health checker.

Actually I've tried this approach in the tests. Checking the presence of a 
health checker seems to break the test possibly because it duplicates the 
health checker preparation in launchTask function. The problem can be solved by 
placing it before we launch a task(in this way, checking the presence of health 
checker does not interferes with the status update sending). I'm kind of 
curious why the test is so time sensitive.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150200
---


On Sept. 23, 2016, 12:57 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 12:57 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-23 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150200
---




src/main/python/apache/aurora/executor/aurora_executor.py (line 81)


I don't think we should expose this simply for the sake of testing, besides 
breaking abstractions, it's also brittle. A bug could set this value to `True` 
even if the behavior we want to test is broken.

We should be able to test the health check logic directly based on the 
behavior of the executor. I.e. assert that the `TASK_RUNNING` update is not 
sent right away if one of the status providers is a health checker.



src/main/python/apache/aurora/executor/aurora_executor.py (lines 122 - 133)


I don't think we need to move all of this logic from 
`_start_status_manager`. We only need to move the creation of the status 
providers from `self._status_providers` here.

The rest (registering them with metrics, adding `self._kill_manager` to the 
list) can remain in `_start_status_manager`.



src/main/python/apache/aurora/executor/aurora_executor.py (line 183)


Not related to your change, but the `driver` argument is unused here. Can 
you drop it?



src/main/python/apache/aurora/executor/common/health_checker.py (line 36)


kill this blank line?



src/main/python/apache/aurora/executor/common/health_checker.py (line 76)


This is the oppositve of `current_consecutive_failures`, right? I.e. it's 
not the total number of healthchecks that have been performed, but instead the 
number that have passed, yes?

If so, consider renaming to `current_consecutive_successes` for symmetry.



src/main/python/apache/aurora/executor/common/health_checker.py (lines 219 - 
220)


Fits on one line.



src/main/python/apache/aurora/executor/status_manager.py (line 50)


Why pass these as a tuple instead of individual args?


- Joshua Cohen


On Sept. 23, 2016, 12:57 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 12:57 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   

Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-22 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150117
---


Ship it!




Master (4ead189) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 23, 2016, 12:57 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 12:57 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-22 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review150115
---




src/main/python/apache/aurora/executor/aurora_executor.py (line 121)


This is perhaps not the best approach. Exposing 
self.is_health_check_enabled for testing purpose seems a little bit hacky. 
However, I see a couple of benefits here:

1. Do things in one pass, reuse the is_health_check_enabled logic as much 
as possible, with minimum duplication.
2. Minimize the change to health checker and task_info.
3. Reflect the descriptions in the comment of this function.
- Initialize the sandbox
- Start the ThermosTaskRunner (fork the Thermos TaskRunner)
- Set up necessary HealthCheckers
- Set up StatusManager, and attach HealthCheckers


- Kai Huang


On Sept. 23, 2016, 12:57 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 23, 2016, 12:57 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-22 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/
---

(Updated Sept. 23, 2016, 12:57 a.m.)


Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.


Changes
---

Refactored is_health_check_enabled, do one pass when setting up the necessary 
HealthCheckers.


Bugs: AURORA-1225
https://issues.apache.org/jira/browse/AURORA-1225


Repository: aurora


Description
---

Modify executor state transition logic to rely on health checks (if enabled).

[Summary]
Executor needs to start executing user content in STARTING and transition to 
RUNNING when a successful required number of health checks is reached.

This review contains a series of executor changes that implement the health 
check driven updates. It gives more context of the design of this feature.

[Background]
Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
and the design doc: 
https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
 for more details and background.

[Description]
If health check is enabled on vCurrent executor, the health checker will send a 
"TASK_RUNNING" message when a successful required number of health checks is 
reached within the initial_interval_secs. On the other hand, a "TASK_FAILED" 
message was sent if the health checker fails to reach the required number of 
health checks within that period, or a maximum number of failed health check 
limit is reached after the initital_interval_secs.

If health check is disabled on the vCurrent executor, it will sends 
"TASK_RUNNING" message to scheduler after the thermos runner was started. In 
this scenario, the behavior of vCurrent executor will be the same as the vPrev 
executor.

[Change List]
The current change set includes:
1. Removed the status memoization in ChainedStatusChecker.
2. Modified the StatusManager to be edge triggered.
3. Changed the Aurora Executor callback function.
4. Modified the Health Checker and redefined the meaning initial_interval_secs.


Diffs (updated)
-

  src/main/python/apache/aurora/executor/aurora_executor.py 
ce5ef680f01831cd89fced8969ae3246c7f60cfd 
  src/main/python/apache/aurora/executor/common/health_checker.py 
5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
  src/main/python/apache/aurora/executor/common/status_checker.py 
795dae2d6b661fc528d952c2315196d94127961f 
  src/main/python/apache/aurora/executor/status_manager.py 
228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
  src/test/python/apache/aurora/executor/common/test_status_checker.py 
5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
  src/test/python/apache/aurora/executor/test_status_manager.py 
ce4679ba1aa7b42cf0115c943d84663030182d23 
  src/test/python/apache/aurora/executor/test_thermos_executor.py 
0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 

Diff: https://reviews.apache.org/r/51876/diff/


Testing
---

./build-support/jenkins/build.sh

./pants test.pytest src/test/python/apache/aurora/executor::


Thanks,

Kai Huang



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-22 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, lines 
> > 279-283
> > 
> >
> > This logic needs to be refactored to reduce duplication and reuse 
> > what's now in `is_health_check_enabled` as much as possible. Ideally, we 
> > should have the only place we extract `health_checker` and the like.
> 
> Kai Huang wrote:
> The complexity is that the health checker did some computation to 
> calculate the port map while going through the if condition. It would be 
> helpful to introduce a utitlity class to store all the side-effect values 
> like port map.

I had two concerns here:
1. If we decide to extract the health_check_enabled() from health_checker.py, 
and put it into task_info.py, this will result in a circular dependency, 
because is_health_check_enabled() function requires some string constants and 
variables defined in health_checker.py, which in turn requires task_info.py. 
This explains why I move the string constants to api.thrift as global constants.

2. It seems that is_health_check_enabled function will always be an overkill. 
We cannot just check if health check is enabled for a task, without computing 
the port maps and extracting the health_checker and health_check_config from 
it. It's coupled with the setting up of a health checker. To eliminate 
duplication, Joshua and I discussed the idea to check if health check is 
enabled while we set up the necessary health checkers. However this approach 
results in flaky tests, probably because it is too heavy-weighted.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 

Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, line 149
> > 
> >
> > leftover?

will remove it.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py, lines 184-185
> > 
> >
> > Are there any guarantees the status_manager will always deal with 
> > `ThermosTaskRunner.EXIT_STATE_MAP` values? The way I see it any unmapped 
> > state will end up not calling the `_shutdown()` at all. Perhaps a safer way 
> > could be having an explicit pair of `_running()` to be called only for 
> > RUNNING state and `_shutdown()` act as a default callback for anything 
> > else? Curious what others think here.
> 
> Kai Huang wrote:
> A safer way is to map all mesos_pb2.TaskState to shutdown(). 
> 
> But I think self.TERMINAL_STATES should be sufficient, for the reason 
> that all status checkers only send terminal status(except for health checker, 
> it also sends TASK_RUNNING). In this way, TERMINAL_STATES will reflect its 
> meaning(If we decide to expand the terminal state set, we can add it in 
> Executor base).

However, I agree that we should pass default callback to StatusManager


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest 

Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py, lines 184-185
> > 
> >
> > Are there any guarantees the status_manager will always deal with 
> > `ThermosTaskRunner.EXIT_STATE_MAP` values? The way I see it any unmapped 
> > state will end up not calling the `_shutdown()` at all. Perhaps a safer way 
> > could be having an explicit pair of `_running()` to be called only for 
> > RUNNING state and `_shutdown()` act as a default callback for anything 
> > else? Curious what others think here.

A safer way is to map all mesos_pb2.TaskState to shutdown(). 

But I think self.TERMINAL_STATES should be sufficient, for the reason that all 
status checkers only send terminal status(except for health checker, it also 
sends TASK_RUNNING). In this way, TERMINAL_STATES will reflect its meaning(If 
we decide to expand the terminal state set, we can add it in Executor base).


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, lines 
> > 162-164
> > 
> >
> > This should be moved into the `_maybe_update_failure_count()` to have 
> > all decision making logic in a single place.

sure, it make sense to remove "initial_in_progress" all together here.


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, lines 
> > 279-283
> > 
> >
> > This logic needs to be refactored to reduce duplication and reuse 
> > what's now in `is_health_check_enabled` as much as possible. Ideally, we 
> > should have the only place we extract `health_checker` and the like.

The complexity is that the health checker did some computation to calculate the 
port map while going through the if condition. It would be helpful to introduce 
a utitlity class to store all the side-effect values like port map.


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, line 234
> > 
> >
> > This should be `mesos_pb2.TASK_RUNNING`, right?

Sure, it seems the status was used in a mixed pattern in the code base. I will 
refactor all occurrences.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   

Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 6:25 p.m., Joshua Cohen wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py, line 135
> > 
> >
> > Should this be a call to the `initial_in_progress` function? If so, 
> > this is indicative of a lack of test coverage, a bug of this caliber should 
> > definitely be causing a test failure somewhere ;).

Maxim suggested moving the logic into one place: maybe_update_failure_count. So 
in this way, our test will cover it.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149859
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149859
---




src/main/python/apache/aurora/executor/aurora_executor.py (lines 184 - 185)


Generally preferable to use a comprehension rather than calling map...

for terminal_state in set([v.status for k, v in 
ThermosTaskRunner.EXIT_STATE_MAP.items()]):
  ...

Alternately you could avoid iterating twice by invoking register directly 
from within the comprehension:

[self._register(v.status, self._shutdown) for _, v in 
ThermosTaskRunner.EXIT_STATE_MAP.items()]

You'd have to move the deduping to `_register` though, since we wouldn't be 
creating the intermediate set to strip the duplicate statuses.

and, on further investigation, *none* of the above should be necessary. We 
already have a set of terminal states defined on `ExecutorBase` which this 
class extends: 
https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/executor_base.py#L30-L35.
 So you should just be able to do:

for terminal_state in self.TERMINAL_STATES:
  self._register(terminal_state, self._shutdown)



src/main/python/apache/aurora/executor/common/health_checker.py (line 128)


Should this be a call to the `initial_in_progress` function? If so, this is 
indicative of a lack of test coverage, a bug of this caliber should definitely 
be causing a test failure somewhere ;).


- Joshua Cohen


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   

Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/executor/status_manager.py, line 68
> > 
> >
> > This is exactly the problem I alluded to above. What if there is no 
> > entry for a given `status` here?

Yeah, I agree it's safer to make shutdown a default behavior here.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang


> On Sept. 21, 2016, 5:34 p.m., Maxim Khutornenko wrote:
> > api/src/main/thrift/org/apache/aurora/gen/api.thrift, lines 1218-1220
> > 
> >
> > What's the motivation behind moving these constants into the thrift 
> > layer? FWICT, these are only used inside the `health_checker.py` and should 
> > stay there.

If we want to move the is_health_check_enable logic into task_info.py. In 
task_info.py, We will need to use these constants defined in health_checker.py, 
which itself depends on task_info.py. So we are running into a circular 
dependency issue here. To avoid duplicating the constant, I made these 
constants global.


- Kai


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149864
---




src/main/python/apache/aurora/executor/common/health_checker.py (lines 153 - 
155)


A typo here, it should be:

if NOT self.health_check_passed:
   log.warning


- Kai Huang


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Maxim Khutornenko


> On Sept. 20, 2016, 12:28 a.m., Kai Huang wrote:
> > src/main/python/apache/aurora/executor/common/status_checker.py, line 104
> > 
> >
> > Use TaskState.Value('TASK_RUNNING') here instead of 
> > mesos_pb2.TASK_RUNNING, because this file also used TaskState in multiple 
> > places.

Not sure I get this. What prevents you from using `mesos_pb2.TASK_RUNNING` here 
and keep `TaskState` in other places?


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149581
---


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-21 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149846
---




api/src/main/thrift/org/apache/aurora/gen/api.thrift (lines 1218 - 1220)


What's the motivation behind moving these constants into the thrift layer? 
FWICT, these are only used inside the `health_checker.py` and should stay there.



src/main/python/apache/aurora/executor/aurora_executor.py (line 80)


This does not have to be a global field, does it? It looks like 
`_register()` only adds callbacks but nothing ever reads from it.



src/main/python/apache/aurora/executor/aurora_executor.py (lines 184 - 185)


Are there any guarantees the status_manager will always deal with 
`ThermosTaskRunner.EXIT_STATE_MAP` values? The way I see it any unmapped state 
will end up not calling the `_shutdown()` at all. Perhaps a safer way could be 
having an explicit pair of `_running()` to be called only for RUNNING state and 
`_shutdown()` act as a default callback for anything else? Curious what others 
think here.



src/main/python/apache/aurora/executor/common/health_checker.py (line 140)


leftover?



src/main/python/apache/aurora/executor/common/health_checker.py (lines 153 - 
155)


This should be moved into the `_maybe_update_failure_count()` to have all 
decision making logic in a single place.



src/main/python/apache/aurora/executor/common/health_checker.py (line 225)


This should be `mesos_pb2.TASK_RUNNING`, right?



src/main/python/apache/aurora/executor/common/health_checker.py (lines 270 - 
274)


This logic needs to be refactored to reduce duplication and reuse what's 
now in `is_health_check_enabled` as much as possible. Ideally, we should have 
the only place we extract `health_checker` and the like.



src/main/python/apache/aurora/executor/status_manager.py (line 68)


This is exactly the problem I alluded to above. What if there is no entry 
for a given `status` here?


- Maxim Khutornenko


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   

Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-19 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149588
---


Ship it!




Master (4c4040f) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-19 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review149581
---




src/main/python/apache/aurora/executor/common/status_checker.py (line 104)


Use TaskState.Value('TASK_RUNNING') here instead of mesos_pb2.TASK_RUNNING, 
because this file also used TaskState in multiple places.


- Kai Huang


On Sept. 20, 2016, 12:25 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 20, 2016, 12:25 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current change set includes:
> 1. Removed the status memoization in ChainedStatusChecker.
> 2. Modified the StatusManager to be edge triggered.
> 3. Changed the Aurora Executor callback function.
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> 
> 
> Diffs
> -
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> c5765b70501c101f0535b4eed94e9948c36808f9 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> ce5ef680f01831cd89fced8969ae3246c7f60cfd 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/aurora/executor/common/status_checker.py 
> 795dae2d6b661fc528d952c2315196d94127961f 
>   src/main/python/apache/aurora/executor/common/task_info.py 
> 4ef49e30eeb942886d02c1fb448055264f9aa874 
>   src/main/python/apache/aurora/executor/status_manager.py 
> 228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/python/apache/aurora/executor/common/test_status_checker.py 
> 5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
>   src/test/python/apache/aurora/executor/common/test_task_info.py 
> 0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
>   src/test/python/apache/aurora/executor/test_status_manager.py 
> ce4679ba1aa7b42cf0115c943d84663030182d23 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> 0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 
> 
> Diff: https://reviews.apache.org/r/51876/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> 
> Thanks,
> 
> Kai Huang
> 
>



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-19 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/
---

(Updated Sept. 20, 2016, 12:25 a.m.)


Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.


Changes
---

Fixed one flaky test for Thermos Executor. Added unit tests for the modified 
components in the notification chain.


Summary (updated)
-

Modify executor state transition logic to rely on health checks (if enabled)


Bugs: AURORA-1225
https://issues.apache.org/jira/browse/AURORA-1225


Repository: aurora


Description (updated)
---

Modify executor state transition logic to rely on health checks (if enabled).

[Summary]
Executor needs to start executing user content in STARTING and transition to 
RUNNING when a successful required number of health checks is reached.

This review contains a series of executor changes that implement the health 
check driven updates. It gives more context of the design of this feature.

[Background]
Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
and the design doc: 
https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
 for more details and background.

[Description]
If health check is enabled on vCurrent executor, the health checker will send a 
"TASK_RUNNING" message when a successful required number of health checks is 
reached within the initial_interval_secs. On the other hand, a "TASK_FAILED" 
message was sent if the health checker fails to reach the required number of 
health checks within that period, or a maximum number of failed health check 
limit is reached after the initital_interval_secs.

If health check is disabled on the vCurrent executor, it will sends 
"TASK_RUNNING" message to scheduler after the thermos runner was started. In 
this scenario, the behavior of vCurrent executor will be the same as the vPrev 
executor.

[Change List]
The current change set includes:
1. Removed the status memoization in ChainedStatusChecker.
2. Modified the StatusManager to be edge triggered.
3. Changed the Aurora Executor callback function.
4. Modified the Health Checker and redefined the meaning initial_interval_secs.


Diffs (updated)
-

  api/src/main/thrift/org/apache/aurora/gen/api.thrift 
c5765b70501c101f0535b4eed94e9948c36808f9 
  src/main/python/apache/aurora/executor/aurora_executor.py 
ce5ef680f01831cd89fced8969ae3246c7f60cfd 
  src/main/python/apache/aurora/executor/common/health_checker.py 
5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
  src/main/python/apache/aurora/executor/common/status_checker.py 
795dae2d6b661fc528d952c2315196d94127961f 
  src/main/python/apache/aurora/executor/common/task_info.py 
4ef49e30eeb942886d02c1fb448055264f9aa874 
  src/main/python/apache/aurora/executor/status_manager.py 
228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
  src/test/python/apache/aurora/executor/common/test_status_checker.py 
5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
  src/test/python/apache/aurora/executor/common/test_task_info.py 
0d9cc5cb340d697a887d8e001f23c948f4fa70c7 
  src/test/python/apache/aurora/executor/test_status_manager.py 
ce4679ba1aa7b42cf0115c943d84663030182d23 
  src/test/python/apache/aurora/executor/test_thermos_executor.py 
0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 

Diff: https://reviews.apache.org/r/51876/diff/


Testing
---

./build-support/jenkins/build.sh

./pants test.pytest src/test/python/apache/aurora/executor::


Thanks,

Kai Huang



Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-13 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/#review148834
---



Master (5069f93) is red with this patch.
  ./build-support/jenkins/build.sh

  INFO] Executor []: Updating sleep2-001 => FINISHED
  INFO] Executor []: Updating sleep2-001 => FINISHED
  INFO] Executor []: Updating sleep2-001 => FINISHED
  INFO] Executor []: Updating sleep2-001 => FINISHED
  INFO] Executor []: Updating sleep2-001 => FINISHED
  INFO] Executor []: Updating sleep2-001 => FINISHED
  INFO] Executor []: Updating sleep2-001 => FINISHED
  INFO] Executor []: Updating sleep2-001 => FINISHED
 I0914 01:16:05.345735 969 executor_base.py:45] Executor 
[]: Updating sleep2-001 => FINISHED
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
  INFO] Executor []:Reason: Task finished.
 I0914 01:16:05.347143 969 executor_base.py:45] Executor 
[]:Reason: Task finished.
  generated xml file: 
/home/jenkins/jenkins-slave/workspace/AuroraBot/dist/test-results/415337499eb72578eab327a6487c1f5c9452b3d6.xml
 
  1 failed, 708 passed, 6 skipped, 1 warnings in 
236.24 seconds 
 
FAILURE


01:16:25 04:31   [complete]
   FAILURE


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 14, 2016, 12:47 a.m., Kai Huang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51876/
> ---
> 
> (Updated Sept. 14, 2016, 12:47 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
> https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Modify executor state transition logic to rely on health checks (if enabled).
> 
> [Summary]
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.
> 
> This review contains a series of executor changes that implement the health 
> check driven updates. It gives more context of the design of this feature.
> 
> [Background]
> Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
> and the design doc: 
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
>  for more details and background.
> 
> [Description]
> If health check is enabled on vCurrent executor, the health checker will send 
> a "TASK_RUNNING" message when a successful required number of health checks 
> is reached within the initial_interval_secs. On the other hand, a 
> "TASK_FAILED" message was sent if the health checker fails to reach the 
> required number of health checks within that period, or a maximum number of 
> failed health check limit is reached after the initital_interval_secs.
> 
> If health check is disabled on the vCurrent executor, it will sends 
> "TASK_RUNNING" message to scheduler after the thermos runner was started. In 
> this scenario, the behavior of vCurrent executor will be the same as the 
> vPrev executor.
> 
> [Change List]
> The current 

Re: Review Request 51876: Modify executor state transition logic to rely on health checks (if enabled)

2016-09-13 Thread Kai Huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51876/
---

(Updated Sept. 14, 2016, 12:40 a.m.)


Review request for Aurora, Joshua Cohen, Maxim Khutornenko, and Zameer Manji.


Changes
---

Modify the change list item.


Bugs: AURORA-1225
https://issues.apache.org/jira/browse/AURORA-1225


Repository: aurora


Description (updated)
---

Modify executor state transition logic to rely on health checks (if enabled).

[Summary]
Executor needs to start executing user content in STARTING and transition to 
RUNNING when a successful required number of health checks is reached.

This review contains a series of executor changes that implement the health 
check driven updates. It gives a more complete context of the design of this 
feature.

[Background]
Please see this epic: https://issues.apache.org/jira/browse/AURORA-1225
and the design doc: 
https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
 for more details and background.

[Description]
If health check is enabled on vCurrent executor, the health checker will send a 
"TASK_RUNNING" message when a successful required number of health checks is 
reached within the initial_interval_secs. On the other hand, a "TASK_FAILED" 
message was sent if if fail to satisift the required number of health checks 
within the initial_interval_secs, or a maximum number of failed health check 
limit is reached after the initital_interval_secs.

If health check is disabled on the vCurrent executor, it will sends 
"TASK_RUNNING" message to scheduler, so the behavior will be the same as the 
vPrev executor.

[Change List]
The current change set includes:
1. Removed the status memoization in ChainedStatusChecker.
2. Modified the StatusManager to be edge triggered.
3. Changed the Aurora Executor callback function.
4. Modified the Health Checker and redefined the meaning initial_interval_secs.

[TODOs]
Currently I fixed all broken tests caused by my changes. However, more tests 
needs to to be added to accomodate to the executor change. I will send 
follow-up review update when I cover more edge cases. But any feedback on 
implementation is highly appreciated.


Diffs
-

  src/main/python/apache/aurora/executor/aurora_executor.py 
ce5ef680f01831cd89fced8969ae3246c7f60cfd 
  src/main/python/apache/aurora/executor/common/health_checker.py 
5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
  src/main/python/apache/aurora/executor/common/status_checker.py 
795dae2d6b661fc528d952c2315196d94127961f 
  src/main/python/apache/aurora/executor/status_manager.py 
228a99a05f339e21cd7e769a42b9b2276e7bc3fc 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
  src/test/python/apache/aurora/executor/common/test_status_checker.py 
5be1981c8c8e88258456adb21aa3ca7c0aa472a7 
  src/test/python/apache/aurora/executor/test_status_manager.py 
ce4679ba1aa7b42cf0115c943d84663030182d23 
  src/test/python/apache/aurora/executor/test_thermos_executor.py 
0bfe9e931f873c9f804f2ba4012e050e1f9fd24e 

Diff: https://reviews.apache.org/r/51876/diff/


Testing
---

./build-support/jenkins/build.sh

./pants test.pytest src/test/python/apache/aurora/executor::


Thanks,

Kai Huang