FYI - some of the end to end tests have started to become very flaky, probably due to this commit. I am not sure if others have seen this behavior.
On Oct 12, 2016 12:51 AM, "David McLaughlin (JIRA)" <j...@apache.org> wrote: > > [ https://issues.apache.org/jira/browse/AURORA-1791?page= > com.atlassian.jira.plugin.system.issuetabpanels:comment- > tabpanel&focusedCommentId=15567954#comment-15567954 ] > > David McLaughlin edited comment on AURORA-1791 at 10/12/16 7:50 AM: > -------------------------------------------------------------------- > > Given the lack of test coverage I've found just looking at a single > function, I would seriously recommend we roll back the commit (or will it > be commits?) rather than rush a patch in order to fix master. Any > objections? cc/ [~zmanji] and [~joshua.cohen]] > > > was (Author: davmclau): > Given the lack of test coverage I've found just looking at a single > function, I would seriously recommend we roll back the commit (or will it > be commits?) rather than rush a patch in order to fix master. Any > objections? cc/ [~zmanji] and [~jcohen] > > > Commit ca683 is not backwards compatible. > > ----------------------------------------- > > > > Key: AURORA-1791 > > URL: https://issues.apache.org/jira/browse/AURORA-1791 > > Project: Aurora > > Issue Type: Bug > > Reporter: Zameer Manji > > Assignee: Kai Huang > > Priority: Blocker > > > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af > 5a73c501b9] is not backwards compatible. The last section of the commit > > {quote} > > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > > {quote} > > has serious, unintended consequences. > > Consider the following health check config: > > {noformat} > > initial_interval_secs: 10 > > interval_secs: 5 > > max_consecutive_failures: 1 > > {noformat} > > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > > On master, health checking starts right away which means the task can > fail at the first second since {{max_consecutive_failures}} is set to 1. > > This is not backwards compatible and needs to be fixed. > > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > > An investigation shows {{initial_interval_secs}} was set to 5 but the > task failed health checks right away: > > {noformat} > > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive > failures counter. > > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > > {noformat} > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >