Re: Review Request 31104: task-executor: TASK_RUNNING after first health check

2015-02-18 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31104/#review72980
---


For this change to be useful, we also have to think about the meaning of 
`initial_interval_secs`. In its curent form, health checks only start when the 
initial delay has passed. Commonly this delay has to be set very high in order 
to guarantee that a task will come up even in a worst case scenario (e.g., 
server where I pull my binary from is slow today). With your change however, no 
task would be considered running until this worst case time window has passed.

A potential solution would be to change the meaning of `initial_interval_secs` 
to always send health checks but to ignore any errors.

- Stephan Erb


On Feb. 18, 2015, 4:32 a.m., Moses Nakamura wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/31104/
 ---
 
 (Updated Feb. 18, 2015, 4:32 a.m.)
 
 
 Review request for Aurora and Brian Wickman.
 
 
 Bugs: AURORA-894
 https://issues.apache.org/jira/browse/AURORA-894
 
 
 Repository: aurora
 
 
 Description
 ---
 
 This is the first step in changing TASK_RUNNING to mean that the application 
 is alive and responding to health checks (if the task is configured to 
 support health checks).  This review is just to get feedback, I can't do this 
 review in parts because the scheduler must be changed in lockstep with the 
 executor, or everything will break.
 
 I don't know if this is the right approach, could you give me some high level 
 advice?  I'm also not sure who to add to this review.
 
 Here is the high level description that we came up with:
 
 http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201412.mbox/%3CCAOTkfX4KTUpMVcjeFf5%3DvvGXb91to5baNSzvyiwtk-sTddxGXQ%40mail.gmail.com%3E
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/aurora_executor.py 
 9c0282392dbb9cca308baf47adc1750c1f5cacc6 
   src/main/python/apache/aurora/executor/common/announcer.py 
 dda76f018f472d7d8228459eb89f4c5daf9df26d 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 60676ba0fbd8a218fe4309f07de28e2c66d54530 
   src/main/python/apache/aurora/executor/common/resource_manager.py 
 08e02e41b581f275f070228bb23c4cf2a0489f9a 
   src/main/python/apache/aurora/executor/common/status_checker.py 
 624921d68199df098ea51ee8a10815403bf58984 
   src/test/python/apache/aurora/executor/common/test_announcer.py 
 6b782778e52394de3744b43003226dac3f65169e 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 def249c2509a28f7145380f250f79202b653dc83 
   
 src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
  8f288f6115ab52265dfada3f41d81271c55a 
 
 Diff: https://reviews.apache.org/r/31104/diff/
 
 
 Testing
 ---
 
 This hangs after I call is_health_checks_enabled, and I don't know why.  My 
 suspicion is that I'm throwing an exception and cratering the task executor, 
 but I don't know how to tell.  How do I get it to print?  I'm running it with:
 
 ./pants test src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 Moses Nakamura
 




Re: Review Request 31104: task-executor: TASK_RUNNING after first health check

2015-02-18 Thread Moses Nakamura


 On Feb. 18, 2015, 7 p.m., Stephan Erb wrote:
  For this change to be useful, we also have to think about the meaning of 
  `initial_interval_secs`. In its curent form, health checks only start when 
  the initial delay has passed. Commonly this delay has to be set very high 
  in order to guarantee that a task will come up even in a worst case 
  scenario (e.g., server where I pull my binary from is slow today). With 
  your change however, no task would be considered running until this worst 
  case time window has passed.
  
  A potential solution would be to change the meaning of 
  `initial_interval_secs` to always send health checks but to ignore any 
  errors.

+1, that's a good idea.


- Moses


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31104/#review72980
---


On Feb. 18, 2015, 4:32 a.m., Moses Nakamura wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/31104/
 ---
 
 (Updated Feb. 18, 2015, 4:32 a.m.)
 
 
 Review request for Aurora and Brian Wickman.
 
 
 Bugs: AURORA-894
 https://issues.apache.org/jira/browse/AURORA-894
 
 
 Repository: aurora
 
 
 Description
 ---
 
 This is the first step in changing TASK_RUNNING to mean that the application 
 is alive and responding to health checks (if the task is configured to 
 support health checks).  This review is just to get feedback, I can't do this 
 review in parts because the scheduler must be changed in lockstep with the 
 executor, or everything will break.
 
 I don't know if this is the right approach, could you give me some high level 
 advice?  I'm also not sure who to add to this review.
 
 Here is the high level description that we came up with:
 
 http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201412.mbox/%3CCAOTkfX4KTUpMVcjeFf5%3DvvGXb91to5baNSzvyiwtk-sTddxGXQ%40mail.gmail.com%3E
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/aurora_executor.py 
 9c0282392dbb9cca308baf47adc1750c1f5cacc6 
   src/main/python/apache/aurora/executor/common/announcer.py 
 dda76f018f472d7d8228459eb89f4c5daf9df26d 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 60676ba0fbd8a218fe4309f07de28e2c66d54530 
   src/main/python/apache/aurora/executor/common/resource_manager.py 
 08e02e41b581f275f070228bb23c4cf2a0489f9a 
   src/main/python/apache/aurora/executor/common/status_checker.py 
 624921d68199df098ea51ee8a10815403bf58984 
   src/test/python/apache/aurora/executor/common/test_announcer.py 
 6b782778e52394de3744b43003226dac3f65169e 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 def249c2509a28f7145380f250f79202b653dc83 
   
 src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
  8f288f6115ab52265dfada3f41d81271c55a 
 
 Diff: https://reviews.apache.org/r/31104/diff/
 
 
 Testing
 ---
 
 This hangs after I call is_health_checks_enabled, and I don't know why.  My 
 suspicion is that I'm throwing an exception and cratering the task executor, 
 but I don't know how to tell.  How do I get it to print?  I'm running it with:
 
 ./pants test src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 Moses Nakamura
 




Re: Review Request 31104: task-executor: TASK_RUNNING after first health check

2015-02-17 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31104/#review72868
---


Will this change break update configs because some time values in the 
UpdateConfig are timeouts until a task enters the RUNNING state?

- Zameer Manji


On Feb. 16, 2015, 10:22 p.m., Moses Nakamura wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/31104/
 ---
 
 (Updated Feb. 16, 2015, 10:22 p.m.)
 
 
 Review request for Aurora and Brian Wickman.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 This is the first step in changing TASK_RUNNING to mean that the application 
 is alive and responding to health checks (if the task is configured to 
 support health checks).  This review is just to get feedback, I can't do this 
 review in parts because the scheduler must be changed in lockstep with the 
 executor, or everything will break.
 
 I don't know if this is the right approach, could you give me some high level 
 advice?  I'm also not sure who to add to this review.
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/aurora_executor.py 
 9c0282392dbb9cca308baf47adc1750c1f5cacc6 
   src/main/python/apache/aurora/executor/common/announcer.py 
 dda76f018f472d7d8228459eb89f4c5daf9df26d 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 60676ba0fbd8a218fe4309f07de28e2c66d54530 
   src/main/python/apache/aurora/executor/common/resource_manager.py 
 08e02e41b581f275f070228bb23c4cf2a0489f9a 
   src/main/python/apache/aurora/executor/common/status_checker.py 
 624921d68199df098ea51ee8a10815403bf58984 
   src/test/python/apache/aurora/executor/common/test_announcer.py 
 6b782778e52394de3744b43003226dac3f65169e 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 def249c2509a28f7145380f250f79202b653dc83 
   
 src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
  8f288f6115ab52265dfada3f41d81271c55a 
 
 Diff: https://reviews.apache.org/r/31104/diff/
 
 
 Testing
 ---
 
 This hangs after I call is_health_checks_enabled, and I don't know why.  My 
 suspicion is that I'm throwing an exception and cratering the task executor, 
 but I don't know how to tell.  How do I get it to print?  I'm running it with:
 
 ./pants test src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 Moses Nakamura
 




Re: Review Request 31104: task-executor: TASK_RUNNING after first health check

2015-02-17 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31104/#review72865
---


Master (4b43305) is red with this patch.
  ./build-support/jenkins/build.sh

 src.test.python.apache.aurora.client.cli.sla   
 .   SUCCESS
 src.test.python.apache.aurora.client.cli.supdate   
 .   SUCCESS
 src.test.python.apache.aurora.client.cli.task  
 .   SUCCESS
 src.test.python.apache.aurora.client.cli.update
 .   SUCCESS
 src.test.python.apache.aurora.client.cli.version   
 .   SUCCESS
 src.test.python.apache.aurora.client.config
 .   SUCCESS
 src.test.python.apache.aurora.client.factory   
 .   SUCCESS
 src.test.python.apache.aurora.client.hooks.hooked_api  
 .   SUCCESS
 src.test.python.apache.aurora.client.hooks.non_hooked_api  
 .   SUCCESS
 src.test.python.apache.aurora.common.test_aurora_job_key   
 .   SUCCESS
 src.test.python.apache.aurora.common.test_cluster  
 .   SUCCESS
 src.test.python.apache.aurora.common.test_cluster_option   
 .   SUCCESS
 src.test.python.apache.aurora.common.test_clusters 
 .   SUCCESS
 src.test.python.apache.aurora.common.test_http_signaler
 .   SUCCESS
 src.test.python.apache.aurora.common.test_pex_version  
 .   SUCCESS
 src.test.python.apache.aurora.common.test_shellify 
 .   SUCCESS
 src.test.python.apache.aurora.common.test_transport
 .   SUCCESS
 src.test.python.apache.aurora.config.test_base 
 .   SUCCESS
 
src.test.python.apache.aurora.config.test_constraint_parsing
.   SUCCESS
 src.test.python.apache.aurora.config.test_loader   
 .   SUCCESS
 src.test.python.apache.aurora.config.test_thrift   
 .   SUCCESS
 src.test.python.apache.aurora.executor.common.announcer
 .   SUCCESS
 
src.test.python.apache.aurora.executor.common.directory_sandbox 
.   SUCCESS
 
src.test.python.apache.aurora.executor.common.executor_detector 
.   SUCCESS
 
src.test.python.apache.aurora.executor.common.executor_timeout  
.   SUCCESS
 
src.test.python.apache.aurora.executor.common.health_checker
.   FAILURE
 
src.test.python.apache.aurora.executor.common.path_detector 
.   SUCCESS
 src.test.python.apache.aurora.executor.common.task_info
 .   SUCCESS
 src.test.python.apache.aurora.executor.executor_base   
 .   SUCCESS
 src.test.python.apache.aurora.executor.executor_vars   
 .   SUCCESS
 src.test.python.apache.aurora.executor.status_manager  
 .   SUCCESS
 src.test.python.apache.aurora.executor.thermos_task_runner 
 .   SUCCESS
 src.test.python.apache.thermos.common.test_pathspec
 .   SUCCESS
 
src.test.python.apache.thermos.core.test_runner_integration 
.   SUCCESS
 src.test.python.apache.thermos.monitoring.test_disk
 .   SUCCESS
 
FAILURE


   FAILURE


I will refresh this build result if you post a review containing @ReviewBot 
retry

- Aurora ReviewBot


On Feb. 17, 2015, 6:22 a.m., Moses Nakamura wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/31104/
 ---
 
 (Updated Feb. 17, 2015, 6:22 a.m.)
 
 
 Review request for Aurora and Brian Wickman.
 
 
 Repository: 

Re: Review Request 31104: task-executor: TASK_RUNNING after first health check

2015-02-17 Thread Bill Farner


 On Feb. 18, 2015, 1:40 a.m., Zameer Manji wrote:
  Will this change break update configs because some time values in the 
  UpdateConfig are timeouts until a task enters the RUNNING state?

It very likely will, communication will be imperative with this change.  Moses 
- can you link this against the ticket and/or discussion thread with background?


- Bill


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31104/#review72868
---


On Feb. 17, 2015, 6:22 a.m., Moses Nakamura wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/31104/
 ---
 
 (Updated Feb. 17, 2015, 6:22 a.m.)
 
 
 Review request for Aurora and Brian Wickman.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 This is the first step in changing TASK_RUNNING to mean that the application 
 is alive and responding to health checks (if the task is configured to 
 support health checks).  This review is just to get feedback, I can't do this 
 review in parts because the scheduler must be changed in lockstep with the 
 executor, or everything will break.
 
 I don't know if this is the right approach, could you give me some high level 
 advice?  I'm also not sure who to add to this review.
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/aurora_executor.py 
 9c0282392dbb9cca308baf47adc1750c1f5cacc6 
   src/main/python/apache/aurora/executor/common/announcer.py 
 dda76f018f472d7d8228459eb89f4c5daf9df26d 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 60676ba0fbd8a218fe4309f07de28e2c66d54530 
   src/main/python/apache/aurora/executor/common/resource_manager.py 
 08e02e41b581f275f070228bb23c4cf2a0489f9a 
   src/main/python/apache/aurora/executor/common/status_checker.py 
 624921d68199df098ea51ee8a10815403bf58984 
   src/test/python/apache/aurora/executor/common/test_announcer.py 
 6b782778e52394de3744b43003226dac3f65169e 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 def249c2509a28f7145380f250f79202b653dc83 
   
 src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
  8f288f6115ab52265dfada3f41d81271c55a 
 
 Diff: https://reviews.apache.org/r/31104/diff/
 
 
 Testing
 ---
 
 This hangs after I call is_health_checks_enabled, and I don't know why.  My 
 suspicion is that I'm throwing an exception and cratering the task executor, 
 but I don't know how to tell.  How do I get it to print?  I'm running it with:
 
 ./pants test src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 Moses Nakamura
 




Re: Review Request 31104: task-executor: TASK_RUNNING after first health check

2015-02-17 Thread Moses Nakamura

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31104/
---

(Updated Feb. 18, 2015, 4:32 a.m.)


Review request for Aurora and Brian Wickman.


Changes
---

+bug, +mailing list link


Bugs: AURORA-894
https://issues.apache.org/jira/browse/AURORA-894


Repository: aurora


Description (updated)
---

This is the first step in changing TASK_RUNNING to mean that the application is 
alive and responding to health checks (if the task is configured to support 
health checks).  This review is just to get feedback, I can't do this review in 
parts because the scheduler must be changed in lockstep with the executor, or 
everything will break.

I don't know if this is the right approach, could you give me some high level 
advice?  I'm also not sure who to add to this review.

Here is the high level description that we came up with:

http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201412.mbox/%3CCAOTkfX4KTUpMVcjeFf5%3DvvGXb91to5baNSzvyiwtk-sTddxGXQ%40mail.gmail.com%3E


Diffs
-

  src/main/python/apache/aurora/executor/aurora_executor.py 
9c0282392dbb9cca308baf47adc1750c1f5cacc6 
  src/main/python/apache/aurora/executor/common/announcer.py 
dda76f018f472d7d8228459eb89f4c5daf9df26d 
  src/main/python/apache/aurora/executor/common/health_checker.py 
60676ba0fbd8a218fe4309f07de28e2c66d54530 
  src/main/python/apache/aurora/executor/common/resource_manager.py 
08e02e41b581f275f070228bb23c4cf2a0489f9a 
  src/main/python/apache/aurora/executor/common/status_checker.py 
624921d68199df098ea51ee8a10815403bf58984 
  src/test/python/apache/aurora/executor/common/test_announcer.py 
6b782778e52394de3744b43003226dac3f65169e 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
def249c2509a28f7145380f250f79202b653dc83 
  
src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
 8f288f6115ab52265dfada3f41d81271c55a 

Diff: https://reviews.apache.org/r/31104/diff/


Testing
---

This hangs after I call is_health_checks_enabled, and I don't know why.  My 
suspicion is that I'm throwing an exception and cratering the task executor, 
but I don't know how to tell.  How do I get it to print?  I'm running it with:

./pants test src/test/python/apache/aurora/executor::


Thanks,

Moses Nakamura



Re: Review Request 31104: task-executor: TASK_RUNNING after first health check

2015-02-17 Thread Moses Nakamura


 On Feb. 18, 2015, 1:40 a.m., Zameer Manji wrote:
  Will this change break update configs because some time values in the 
  UpdateConfig are timeouts until a task enters the RUNNING state?
 
 Bill Farner wrote:
 It very likely will, communication will be imperative with this change.  
 Moses - can you link this against the ticket and/or discussion thread with 
 background?

If I understand your question correctly, yes, that is the intent.  If you have 
health checks, passing a health check will short-circuit the OK, we can move 
on to the next batch step.  If you don't have health checks, it should 
continue to operate as it has already done, which is to say that if it will be 
marked successful unless it gets marked as failed by the executor (I think for 
non-health checking services this means the process died).


- Moses


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31104/#review72868
---


On Feb. 18, 2015, 4:32 a.m., Moses Nakamura wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/31104/
 ---
 
 (Updated Feb. 18, 2015, 4:32 a.m.)
 
 
 Review request for Aurora and Brian Wickman.
 
 
 Bugs: AURORA-894
 https://issues.apache.org/jira/browse/AURORA-894
 
 
 Repository: aurora
 
 
 Description
 ---
 
 This is the first step in changing TASK_RUNNING to mean that the application 
 is alive and responding to health checks (if the task is configured to 
 support health checks).  This review is just to get feedback, I can't do this 
 review in parts because the scheduler must be changed in lockstep with the 
 executor, or everything will break.
 
 I don't know if this is the right approach, could you give me some high level 
 advice?  I'm also not sure who to add to this review.
 
 Here is the high level description that we came up with:
 
 http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201412.mbox/%3CCAOTkfX4KTUpMVcjeFf5%3DvvGXb91to5baNSzvyiwtk-sTddxGXQ%40mail.gmail.com%3E
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/aurora_executor.py 
 9c0282392dbb9cca308baf47adc1750c1f5cacc6 
   src/main/python/apache/aurora/executor/common/announcer.py 
 dda76f018f472d7d8228459eb89f4c5daf9df26d 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 60676ba0fbd8a218fe4309f07de28e2c66d54530 
   src/main/python/apache/aurora/executor/common/resource_manager.py 
 08e02e41b581f275f070228bb23c4cf2a0489f9a 
   src/main/python/apache/aurora/executor/common/status_checker.py 
 624921d68199df098ea51ee8a10815403bf58984 
   src/test/python/apache/aurora/executor/common/test_announcer.py 
 6b782778e52394de3744b43003226dac3f65169e 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 def249c2509a28f7145380f250f79202b653dc83 
   
 src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
  8f288f6115ab52265dfada3f41d81271c55a 
 
 Diff: https://reviews.apache.org/r/31104/diff/
 
 
 Testing
 ---
 
 This hangs after I call is_health_checks_enabled, and I don't know why.  My 
 suspicion is that I'm throwing an exception and cratering the task executor, 
 but I don't know how to tell.  How do I get it to print?  I'm running it with:
 
 ./pants test src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 Moses Nakamura
 




Review Request 31104: task-executor: TASK_RUNNING after first health check

2015-02-16 Thread Moses Nakamura

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31104/
---

Review request for Aurora and Brian Wickman.


Repository: aurora


Description
---

This is the first step in changing TASK_RUNNING to mean that the application is 
alive and responding to health checks (if the task is configured to support 
health checks).  This review is just to get feedback, I can't do this review in 
parts because the scheduler must be changed in lockstep with the executor, or 
everything will break.

I don't know if this is the right approach, could you give me some high level 
advice?  I'm also not sure who to add to this review.


Diffs
-

  src/main/python/apache/aurora/executor/aurora_executor.py 
9c0282392dbb9cca308baf47adc1750c1f5cacc6 
  src/main/python/apache/aurora/executor/common/announcer.py 
dda76f018f472d7d8228459eb89f4c5daf9df26d 
  src/main/python/apache/aurora/executor/common/health_checker.py 
60676ba0fbd8a218fe4309f07de28e2c66d54530 
  src/main/python/apache/aurora/executor/common/resource_manager.py 
08e02e41b581f275f070228bb23c4cf2a0489f9a 
  src/main/python/apache/aurora/executor/common/status_checker.py 
624921d68199df098ea51ee8a10815403bf58984 
  src/test/python/apache/aurora/executor/common/test_announcer.py 
6b782778e52394de3744b43003226dac3f65169e 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
def249c2509a28f7145380f250f79202b653dc83 
  
src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
 8f288f6115ab52265dfada3f41d81271c55a 

Diff: https://reviews.apache.org/r/31104/diff/


Testing
---

This hangs after I call is_health_checks_enabled, and I don't know why.  My 
suspicion is that I'm throwing an exception and cratering the task executor, 
but I don't know how to tell.  How do I get it to print?  I'm running it with:

./pants test src/test/python/apache/aurora/executor::


Thanks,

Moses Nakamura