Re: Review Request 31104: task-executor: TASK_RUNNING after first health check
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/#review72980 --- For this change to be useful, we also have to think about the meaning of `initial_interval_secs`. In its curent form, health checks only start when the initial delay has passed. Commonly this delay has to be set very high in order to guarantee that a task will come up even in a worst case scenario (e.g., server where I pull my binary from is slow today). With your change however, no task would be considered running until this worst case time window has passed. A potential solution would be to change the meaning of `initial_interval_secs` to always send health checks but to ignore any errors. - Stephan Erb On Feb. 18, 2015, 4:32 a.m., Moses Nakamura wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/ --- (Updated Feb. 18, 2015, 4:32 a.m.) Review request for Aurora and Brian Wickman. Bugs: AURORA-894 https://issues.apache.org/jira/browse/AURORA-894 Repository: aurora Description --- This is the first step in changing TASK_RUNNING to mean that the application is alive and responding to health checks (if the task is configured to support health checks). This review is just to get feedback, I can't do this review in parts because the scheduler must be changed in lockstep with the executor, or everything will break. I don't know if this is the right approach, could you give me some high level advice? I'm also not sure who to add to this review. Here is the high level description that we came up with: http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201412.mbox/%3CCAOTkfX4KTUpMVcjeFf5%3DvvGXb91to5baNSzvyiwtk-sTddxGXQ%40mail.gmail.com%3E Diffs - src/main/python/apache/aurora/executor/aurora_executor.py 9c0282392dbb9cca308baf47adc1750c1f5cacc6 src/main/python/apache/aurora/executor/common/announcer.py dda76f018f472d7d8228459eb89f4c5daf9df26d src/main/python/apache/aurora/executor/common/health_checker.py 60676ba0fbd8a218fe4309f07de28e2c66d54530 src/main/python/apache/aurora/executor/common/resource_manager.py 08e02e41b581f275f070228bb23c4cf2a0489f9a src/main/python/apache/aurora/executor/common/status_checker.py 624921d68199df098ea51ee8a10815403bf58984 src/test/python/apache/aurora/executor/common/test_announcer.py 6b782778e52394de3744b43003226dac3f65169e src/test/python/apache/aurora/executor/common/test_health_checker.py def249c2509a28f7145380f250f79202b653dc83 src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py 8f288f6115ab52265dfada3f41d81271c55a Diff: https://reviews.apache.org/r/31104/diff/ Testing --- This hangs after I call is_health_checks_enabled, and I don't know why. My suspicion is that I'm throwing an exception and cratering the task executor, but I don't know how to tell. How do I get it to print? I'm running it with: ./pants test src/test/python/apache/aurora/executor:: Thanks, Moses Nakamura
Re: Review Request 31104: task-executor: TASK_RUNNING after first health check
On Feb. 18, 2015, 7 p.m., Stephan Erb wrote: For this change to be useful, we also have to think about the meaning of `initial_interval_secs`. In its curent form, health checks only start when the initial delay has passed. Commonly this delay has to be set very high in order to guarantee that a task will come up even in a worst case scenario (e.g., server where I pull my binary from is slow today). With your change however, no task would be considered running until this worst case time window has passed. A potential solution would be to change the meaning of `initial_interval_secs` to always send health checks but to ignore any errors. +1, that's a good idea. - Moses --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/#review72980 --- On Feb. 18, 2015, 4:32 a.m., Moses Nakamura wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/ --- (Updated Feb. 18, 2015, 4:32 a.m.) Review request for Aurora and Brian Wickman. Bugs: AURORA-894 https://issues.apache.org/jira/browse/AURORA-894 Repository: aurora Description --- This is the first step in changing TASK_RUNNING to mean that the application is alive and responding to health checks (if the task is configured to support health checks). This review is just to get feedback, I can't do this review in parts because the scheduler must be changed in lockstep with the executor, or everything will break. I don't know if this is the right approach, could you give me some high level advice? I'm also not sure who to add to this review. Here is the high level description that we came up with: http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201412.mbox/%3CCAOTkfX4KTUpMVcjeFf5%3DvvGXb91to5baNSzvyiwtk-sTddxGXQ%40mail.gmail.com%3E Diffs - src/main/python/apache/aurora/executor/aurora_executor.py 9c0282392dbb9cca308baf47adc1750c1f5cacc6 src/main/python/apache/aurora/executor/common/announcer.py dda76f018f472d7d8228459eb89f4c5daf9df26d src/main/python/apache/aurora/executor/common/health_checker.py 60676ba0fbd8a218fe4309f07de28e2c66d54530 src/main/python/apache/aurora/executor/common/resource_manager.py 08e02e41b581f275f070228bb23c4cf2a0489f9a src/main/python/apache/aurora/executor/common/status_checker.py 624921d68199df098ea51ee8a10815403bf58984 src/test/python/apache/aurora/executor/common/test_announcer.py 6b782778e52394de3744b43003226dac3f65169e src/test/python/apache/aurora/executor/common/test_health_checker.py def249c2509a28f7145380f250f79202b653dc83 src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py 8f288f6115ab52265dfada3f41d81271c55a Diff: https://reviews.apache.org/r/31104/diff/ Testing --- This hangs after I call is_health_checks_enabled, and I don't know why. My suspicion is that I'm throwing an exception and cratering the task executor, but I don't know how to tell. How do I get it to print? I'm running it with: ./pants test src/test/python/apache/aurora/executor:: Thanks, Moses Nakamura
Re: Review Request 31104: task-executor: TASK_RUNNING after first health check
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/#review72868 --- Will this change break update configs because some time values in the UpdateConfig are timeouts until a task enters the RUNNING state? - Zameer Manji On Feb. 16, 2015, 10:22 p.m., Moses Nakamura wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/ --- (Updated Feb. 16, 2015, 10:22 p.m.) Review request for Aurora and Brian Wickman. Repository: aurora Description --- This is the first step in changing TASK_RUNNING to mean that the application is alive and responding to health checks (if the task is configured to support health checks). This review is just to get feedback, I can't do this review in parts because the scheduler must be changed in lockstep with the executor, or everything will break. I don't know if this is the right approach, could you give me some high level advice? I'm also not sure who to add to this review. Diffs - src/main/python/apache/aurora/executor/aurora_executor.py 9c0282392dbb9cca308baf47adc1750c1f5cacc6 src/main/python/apache/aurora/executor/common/announcer.py dda76f018f472d7d8228459eb89f4c5daf9df26d src/main/python/apache/aurora/executor/common/health_checker.py 60676ba0fbd8a218fe4309f07de28e2c66d54530 src/main/python/apache/aurora/executor/common/resource_manager.py 08e02e41b581f275f070228bb23c4cf2a0489f9a src/main/python/apache/aurora/executor/common/status_checker.py 624921d68199df098ea51ee8a10815403bf58984 src/test/python/apache/aurora/executor/common/test_announcer.py 6b782778e52394de3744b43003226dac3f65169e src/test/python/apache/aurora/executor/common/test_health_checker.py def249c2509a28f7145380f250f79202b653dc83 src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py 8f288f6115ab52265dfada3f41d81271c55a Diff: https://reviews.apache.org/r/31104/diff/ Testing --- This hangs after I call is_health_checks_enabled, and I don't know why. My suspicion is that I'm throwing an exception and cratering the task executor, but I don't know how to tell. How do I get it to print? I'm running it with: ./pants test src/test/python/apache/aurora/executor:: Thanks, Moses Nakamura
Re: Review Request 31104: task-executor: TASK_RUNNING after first health check
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/#review72865 --- Master (4b43305) is red with this patch. ./build-support/jenkins/build.sh src.test.python.apache.aurora.client.cli.sla . SUCCESS src.test.python.apache.aurora.client.cli.supdate . SUCCESS src.test.python.apache.aurora.client.cli.task . SUCCESS src.test.python.apache.aurora.client.cli.update . SUCCESS src.test.python.apache.aurora.client.cli.version . SUCCESS src.test.python.apache.aurora.client.config . SUCCESS src.test.python.apache.aurora.client.factory . SUCCESS src.test.python.apache.aurora.client.hooks.hooked_api . SUCCESS src.test.python.apache.aurora.client.hooks.non_hooked_api . SUCCESS src.test.python.apache.aurora.common.test_aurora_job_key . SUCCESS src.test.python.apache.aurora.common.test_cluster . SUCCESS src.test.python.apache.aurora.common.test_cluster_option . SUCCESS src.test.python.apache.aurora.common.test_clusters . SUCCESS src.test.python.apache.aurora.common.test_http_signaler . SUCCESS src.test.python.apache.aurora.common.test_pex_version . SUCCESS src.test.python.apache.aurora.common.test_shellify . SUCCESS src.test.python.apache.aurora.common.test_transport . SUCCESS src.test.python.apache.aurora.config.test_base . SUCCESS src.test.python.apache.aurora.config.test_constraint_parsing . SUCCESS src.test.python.apache.aurora.config.test_loader . SUCCESS src.test.python.apache.aurora.config.test_thrift . SUCCESS src.test.python.apache.aurora.executor.common.announcer . SUCCESS src.test.python.apache.aurora.executor.common.directory_sandbox . SUCCESS src.test.python.apache.aurora.executor.common.executor_detector . SUCCESS src.test.python.apache.aurora.executor.common.executor_timeout . SUCCESS src.test.python.apache.aurora.executor.common.health_checker . FAILURE src.test.python.apache.aurora.executor.common.path_detector . SUCCESS src.test.python.apache.aurora.executor.common.task_info . SUCCESS src.test.python.apache.aurora.executor.executor_base . SUCCESS src.test.python.apache.aurora.executor.executor_vars . SUCCESS src.test.python.apache.aurora.executor.status_manager . SUCCESS src.test.python.apache.aurora.executor.thermos_task_runner . SUCCESS src.test.python.apache.thermos.common.test_pathspec . SUCCESS src.test.python.apache.thermos.core.test_runner_integration . SUCCESS src.test.python.apache.thermos.monitoring.test_disk . SUCCESS FAILURE [31m FAILURE[0m I will refresh this build result if you post a review containing @ReviewBot retry - Aurora ReviewBot On Feb. 17, 2015, 6:22 a.m., Moses Nakamura wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/ --- (Updated Feb. 17, 2015, 6:22 a.m.) Review request for Aurora and Brian Wickman. Repository:
Re: Review Request 31104: task-executor: TASK_RUNNING after first health check
On Feb. 18, 2015, 1:40 a.m., Zameer Manji wrote: Will this change break update configs because some time values in the UpdateConfig are timeouts until a task enters the RUNNING state? It very likely will, communication will be imperative with this change. Moses - can you link this against the ticket and/or discussion thread with background? - Bill --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/#review72868 --- On Feb. 17, 2015, 6:22 a.m., Moses Nakamura wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/ --- (Updated Feb. 17, 2015, 6:22 a.m.) Review request for Aurora and Brian Wickman. Repository: aurora Description --- This is the first step in changing TASK_RUNNING to mean that the application is alive and responding to health checks (if the task is configured to support health checks). This review is just to get feedback, I can't do this review in parts because the scheduler must be changed in lockstep with the executor, or everything will break. I don't know if this is the right approach, could you give me some high level advice? I'm also not sure who to add to this review. Diffs - src/main/python/apache/aurora/executor/aurora_executor.py 9c0282392dbb9cca308baf47adc1750c1f5cacc6 src/main/python/apache/aurora/executor/common/announcer.py dda76f018f472d7d8228459eb89f4c5daf9df26d src/main/python/apache/aurora/executor/common/health_checker.py 60676ba0fbd8a218fe4309f07de28e2c66d54530 src/main/python/apache/aurora/executor/common/resource_manager.py 08e02e41b581f275f070228bb23c4cf2a0489f9a src/main/python/apache/aurora/executor/common/status_checker.py 624921d68199df098ea51ee8a10815403bf58984 src/test/python/apache/aurora/executor/common/test_announcer.py 6b782778e52394de3744b43003226dac3f65169e src/test/python/apache/aurora/executor/common/test_health_checker.py def249c2509a28f7145380f250f79202b653dc83 src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py 8f288f6115ab52265dfada3f41d81271c55a Diff: https://reviews.apache.org/r/31104/diff/ Testing --- This hangs after I call is_health_checks_enabled, and I don't know why. My suspicion is that I'm throwing an exception and cratering the task executor, but I don't know how to tell. How do I get it to print? I'm running it with: ./pants test src/test/python/apache/aurora/executor:: Thanks, Moses Nakamura
Re: Review Request 31104: task-executor: TASK_RUNNING after first health check
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/ --- (Updated Feb. 18, 2015, 4:32 a.m.) Review request for Aurora and Brian Wickman. Changes --- +bug, +mailing list link Bugs: AURORA-894 https://issues.apache.org/jira/browse/AURORA-894 Repository: aurora Description (updated) --- This is the first step in changing TASK_RUNNING to mean that the application is alive and responding to health checks (if the task is configured to support health checks). This review is just to get feedback, I can't do this review in parts because the scheduler must be changed in lockstep with the executor, or everything will break. I don't know if this is the right approach, could you give me some high level advice? I'm also not sure who to add to this review. Here is the high level description that we came up with: http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201412.mbox/%3CCAOTkfX4KTUpMVcjeFf5%3DvvGXb91to5baNSzvyiwtk-sTddxGXQ%40mail.gmail.com%3E Diffs - src/main/python/apache/aurora/executor/aurora_executor.py 9c0282392dbb9cca308baf47adc1750c1f5cacc6 src/main/python/apache/aurora/executor/common/announcer.py dda76f018f472d7d8228459eb89f4c5daf9df26d src/main/python/apache/aurora/executor/common/health_checker.py 60676ba0fbd8a218fe4309f07de28e2c66d54530 src/main/python/apache/aurora/executor/common/resource_manager.py 08e02e41b581f275f070228bb23c4cf2a0489f9a src/main/python/apache/aurora/executor/common/status_checker.py 624921d68199df098ea51ee8a10815403bf58984 src/test/python/apache/aurora/executor/common/test_announcer.py 6b782778e52394de3744b43003226dac3f65169e src/test/python/apache/aurora/executor/common/test_health_checker.py def249c2509a28f7145380f250f79202b653dc83 src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py 8f288f6115ab52265dfada3f41d81271c55a Diff: https://reviews.apache.org/r/31104/diff/ Testing --- This hangs after I call is_health_checks_enabled, and I don't know why. My suspicion is that I'm throwing an exception and cratering the task executor, but I don't know how to tell. How do I get it to print? I'm running it with: ./pants test src/test/python/apache/aurora/executor:: Thanks, Moses Nakamura
Re: Review Request 31104: task-executor: TASK_RUNNING after first health check
On Feb. 18, 2015, 1:40 a.m., Zameer Manji wrote: Will this change break update configs because some time values in the UpdateConfig are timeouts until a task enters the RUNNING state? Bill Farner wrote: It very likely will, communication will be imperative with this change. Moses - can you link this against the ticket and/or discussion thread with background? If I understand your question correctly, yes, that is the intent. If you have health checks, passing a health check will short-circuit the OK, we can move on to the next batch step. If you don't have health checks, it should continue to operate as it has already done, which is to say that if it will be marked successful unless it gets marked as failed by the executor (I think for non-health checking services this means the process died). - Moses --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/#review72868 --- On Feb. 18, 2015, 4:32 a.m., Moses Nakamura wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/ --- (Updated Feb. 18, 2015, 4:32 a.m.) Review request for Aurora and Brian Wickman. Bugs: AURORA-894 https://issues.apache.org/jira/browse/AURORA-894 Repository: aurora Description --- This is the first step in changing TASK_RUNNING to mean that the application is alive and responding to health checks (if the task is configured to support health checks). This review is just to get feedback, I can't do this review in parts because the scheduler must be changed in lockstep with the executor, or everything will break. I don't know if this is the right approach, could you give me some high level advice? I'm also not sure who to add to this review. Here is the high level description that we came up with: http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201412.mbox/%3CCAOTkfX4KTUpMVcjeFf5%3DvvGXb91to5baNSzvyiwtk-sTddxGXQ%40mail.gmail.com%3E Diffs - src/main/python/apache/aurora/executor/aurora_executor.py 9c0282392dbb9cca308baf47adc1750c1f5cacc6 src/main/python/apache/aurora/executor/common/announcer.py dda76f018f472d7d8228459eb89f4c5daf9df26d src/main/python/apache/aurora/executor/common/health_checker.py 60676ba0fbd8a218fe4309f07de28e2c66d54530 src/main/python/apache/aurora/executor/common/resource_manager.py 08e02e41b581f275f070228bb23c4cf2a0489f9a src/main/python/apache/aurora/executor/common/status_checker.py 624921d68199df098ea51ee8a10815403bf58984 src/test/python/apache/aurora/executor/common/test_announcer.py 6b782778e52394de3744b43003226dac3f65169e src/test/python/apache/aurora/executor/common/test_health_checker.py def249c2509a28f7145380f250f79202b653dc83 src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py 8f288f6115ab52265dfada3f41d81271c55a Diff: https://reviews.apache.org/r/31104/diff/ Testing --- This hangs after I call is_health_checks_enabled, and I don't know why. My suspicion is that I'm throwing an exception and cratering the task executor, but I don't know how to tell. How do I get it to print? I'm running it with: ./pants test src/test/python/apache/aurora/executor:: Thanks, Moses Nakamura
Review Request 31104: task-executor: TASK_RUNNING after first health check
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31104/ --- Review request for Aurora and Brian Wickman. Repository: aurora Description --- This is the first step in changing TASK_RUNNING to mean that the application is alive and responding to health checks (if the task is configured to support health checks). This review is just to get feedback, I can't do this review in parts because the scheduler must be changed in lockstep with the executor, or everything will break. I don't know if this is the right approach, could you give me some high level advice? I'm also not sure who to add to this review. Diffs - src/main/python/apache/aurora/executor/aurora_executor.py 9c0282392dbb9cca308baf47adc1750c1f5cacc6 src/main/python/apache/aurora/executor/common/announcer.py dda76f018f472d7d8228459eb89f4c5daf9df26d src/main/python/apache/aurora/executor/common/health_checker.py 60676ba0fbd8a218fe4309f07de28e2c66d54530 src/main/python/apache/aurora/executor/common/resource_manager.py 08e02e41b581f275f070228bb23c4cf2a0489f9a src/main/python/apache/aurora/executor/common/status_checker.py 624921d68199df098ea51ee8a10815403bf58984 src/test/python/apache/aurora/executor/common/test_announcer.py 6b782778e52394de3744b43003226dac3f65169e src/test/python/apache/aurora/executor/common/test_health_checker.py def249c2509a28f7145380f250f79202b653dc83 src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py 8f288f6115ab52265dfada3f41d81271c55a Diff: https://reviews.apache.org/r/31104/diff/ Testing --- This hangs after I call is_health_checks_enabled, and I don't know why. My suspicion is that I'm throwing an exception and cratering the task executor, but I don't know how to tell. How do I get it to print? I'm running it with: ./pants test src/test/python/apache/aurora/executor:: Thanks, Moses Nakamura