> > How does rollback work in that case
Rollback behavior is unchanged when update pulses are enabled. disable auto-rollback That's also a feasible option. On Wed, Nov 1, 2017 at 9:15 AM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > Signal = > - exit status from service > - reason code from mesos, it task was killed by Mesos e.g. revocable core > revoked during oversubscription > > Yes, I am aware of co-ordinated updates which allow this logic to be > placed outside Aurora. How does rollback work in that case? Perhaps I > should just disable auto-rollback in that case and out the rollback logic > also into this external system. > > On Wed, Nov 1, 2017 at 8:39 AM, Bill Farner <wfar...@apache.org> wrote: > >> Can Aurora distinguish between failures caused by the upgrade itself or >>> other transient systemic issues >> >> >> There isn't any signal i know of that would allow Aurora to independently >> determine the cause of task failures in a generic way. >> >> Two options come to mind: >> 1. Human intervention - aurora update pause from the CLI >> 2. Configure jobs to use JobUpdateSettings.blockIfNoPulsesAfterMs >> <https://github.com/apache/aurora/blob/d106b4ecc9537b8e844c4edc2210b9fe1853ccc4/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L708-L714>, >> and set up an in-house service to invoke pulseJobUpdate() >> <https://github.com/apache/aurora/blob/d106b4ecc9537b8e844c4edc2210b9fe1853ccc4/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L1134-L1139>. >> This opts the job update into requiring periodic positive acknowledgement >> from an external system that it is safe to proceed. You could use this, >> for example, to automatically gate an update while a service has alerts >> firing. >> >> >> >> On Tue, Oct 31, 2017 at 1:14 PM, Mohit Jaggi <mohit.ja...@uber.com> >> wrote: >> >>> Folks, >>> Sometimes in our cluster upgrades start failing due to transient outages >>> of dependencies or reasons unrelated to the new code being pushed out. >>> Aurora hits its failure threshold and starts automatic rollback which may >>> make a bad condition worse (e.g. if the outage was related to load rollback >>> will increase load). Can Aurora distinguish between failures caused by the >>> upgrade itself or other transient systemic issues (using e.g. reason code)? >>> If not does this make sense as a new feature? >>> >>> Mohit. >>> >>> >> >