Re: distinguishing failure types during upgrade

Bill Farner Wed, 01 Nov 2017 08:40:00 -0700

>
> Can Aurora distinguish between failures caused by the upgrade itself or
> other transient systemic issues

There isn't any signal i know of that would allow Aurora to independently
determine the cause of task failures in a generic way.

Two options come to mind:
1. Human intervention - aurora update pause from the CLI
2. Configure jobs to use JobUpdateSettings.blockIfNoPulsesAfterMs
<https://github.com/apache/aurora/blob/d106b4ecc9537b8e844c4edc2210b9fe1853ccc4/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L708-L714>,
and set up an in-house service to invoke pulseJobUpdate()
<https://github.com/apache/aurora/blob/d106b4ecc9537b8e844c4edc2210b9fe1853ccc4/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L1134-L1139>.
This opts the job update into requiring periodic positive acknowledgement
from an external system that it is safe to proceed.  You could use this,
for example, to automatically gate an update while a service has alerts
firing.

On Tue, Oct 31, 2017 at 1:14 PM, Mohit Jaggi <[email protected]> wrote:

> Folks,
> Sometimes in our cluster upgrades start failing due to transient outages
> of dependencies or reasons unrelated to the new code being pushed out.
> Aurora hits its failure threshold and starts automatic rollback which may
> make a bad condition worse (e.g. if the outage was related to load rollback
> will increase load). Can Aurora distinguish between failures caused by the
> upgrade itself or other transient systemic issues (using e.g. reason code)?
> If not does this make sense as a new feature?
>
> Mohit.
>
>

Re: distinguishing failure types during upgrade

Reply via email to