> > Can Aurora distinguish between failures caused by the upgrade itself or > other transient systemic issues
There isn't any signal i know of that would allow Aurora to independently determine the cause of task failures in a generic way. Two options come to mind: 1. Human intervention - aurora update pause from the CLI 2. Configure jobs to use JobUpdateSettings.blockIfNoPulsesAfterMs <https://github.com/apache/aurora/blob/d106b4ecc9537b8e844c4edc2210b9fe1853ccc4/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L708-L714>, and set up an in-house service to invoke pulseJobUpdate() <https://github.com/apache/aurora/blob/d106b4ecc9537b8e844c4edc2210b9fe1853ccc4/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L1134-L1139>. This opts the job update into requiring periodic positive acknowledgement from an external system that it is safe to proceed. You could use this, for example, to automatically gate an update while a service has alerts firing. On Tue, Oct 31, 2017 at 1:14 PM, Mohit Jaggi <[email protected]> wrote: > Folks, > Sometimes in our cluster upgrades start failing due to transient outages > of dependencies or reasons unrelated to the new code being pushed out. > Aurora hits its failure threshold and starts automatic rollback which may > make a bad condition worse (e.g. if the outage was related to load rollback > will increase load). Can Aurora distinguish between failures caused by the > upgrade itself or other transient systemic issues (using e.g. reason code)? > If not does this make sense as a new feature? > > Mohit. > >
