FYI - some of the end to end tests have started to become very flaky,
probably due to this commit. I am not sure if others have seen this
behavior.

On Oct 12, 2016 12:51 AM, "David McLaughlin (JIRA)" <j...@apache.org> wrote:

>
>     [ https://issues.apache.org/jira/browse/AURORA-1791?page=
> com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel&focusedCommentId=15567954#comment-15567954 ]
>
> David McLaughlin edited comment on AURORA-1791 at 10/12/16 7:50 AM:
> --------------------------------------------------------------------
>
> Given the lack of test coverage I've found just looking at a single
> function, I would seriously recommend we roll back the commit (or will it
> be commits?) rather than rush a patch in order to fix master. Any
> objections? cc/ [~zmanji] and [~joshua.cohen]]
>
>
> was (Author: davmclau):
> Given the lack of test coverage I've found just looking at a single
> function, I would seriously recommend we roll back the commit (or will it
> be commits?) rather than rush a patch in order to fix master. Any
> objections? cc/ [~zmanji] and [~jcohen]
>
> > Commit ca683 is not backwards compatible.
> > -----------------------------------------
> >
> >                 Key: AURORA-1791
> >                 URL: https://issues.apache.org/jira/browse/AURORA-1791
> >             Project: Aurora
> >          Issue Type: Bug
> >            Reporter: Zameer Manji
> >            Assignee: Kai Huang
> >            Priority: Blocker
> >
> > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 |
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af
> 5a73c501b9] is not backwards compatible. The last section of the commit
> > {quote}
> > 4. Modified the Health Checker and redefined the meaning
> initial_interval_secs.
> > {quote}
> > has serious, unintended consequences.
> > Consider the following health check config:
> > {noformat}
> >       initial_interval_secs: 10
> >       interval_secs: 5
> >       max_consecutive_failures: 1
> > {noformat}
> > On the 0.16.0 executor, no health checking will occur for the first 10
> seconds. Here the earliest a task can cause failure is at the 10th second.
> > On master, health checking starts right away which means the task can
> fail at the first second since {{max_consecutive_failures}} is set to 1.
> > This is not backwards compatible and needs to be fixed.
> > I think a good solution would be to revert the meaning change to
> initial_interval_secs and have the task transition into RUNNING when
> {{max_consecutive_successes}} is met.
> > An investigation shows {{initial_interval_secs}} was set to 5 but the
> task failed health checks right away:
> > {noformat}
> > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled.
> Performing health check.
> > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive
> failures counter.
> > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum
> consecutive successes.
> > {noformat}
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Reply via email to