Aurora doesn't currently offer a way to do what you describe.

A job in the scheduler describes a provisioning goal (number of instances),
and we assume the scheduler shouldn't choose to modify that goal over
time.  To that end, the scheduler doesn't consider it a problem to
infinitely restart the failed instances; it is hopeful that the environment
will eventually self-heal.


On Mon, Sep 18, 2017 at 5:13 PM, Kaiwen Xu <[email protected]> wrote:

> Hi,
>
> I am wondering if it's there is any way for Aurora to kill the failed
> instances when a job update is not successful (e.g. apps on some
> backends
> fail to start up etc.)?
>
> Since right now, we turned off the "rollback" feature during the job
> update, because of one or two backends (out of tens to hundreds
> backends)
> failing is acceptable for us, we don't want completely rollback the
> whole fleet due to that. However, it seems like with "rollback" off,
> those failed backends will just be left there, and they will try to
> restart infinitely.
>
> Just curious what would be a recommended approach for this situation?
> Should we try to identify those instances and stop them in our own
> deployment scripts?
>
> Thanks,
> Kaiwen
>

Reply via email to