Aurora doesn't currently offer a way to do what you describe. A job in the scheduler describes a provisioning goal (number of instances), and we assume the scheduler shouldn't choose to modify that goal over time. To that end, the scheduler doesn't consider it a problem to infinitely restart the failed instances; it is hopeful that the environment will eventually self-heal.
On Mon, Sep 18, 2017 at 5:13 PM, Kaiwen Xu <[email protected]> wrote: > Hi, > > I am wondering if it's there is any way for Aurora to kill the failed > instances when a job update is not successful (e.g. apps on some > backends > fail to start up etc.)? > > Since right now, we turned off the "rollback" feature during the job > update, because of one or two backends (out of tens to hundreds > backends) > failing is acceptable for us, we don't want completely rollback the > whole fleet due to that. However, it seems like with "rollback" off, > those failed backends will just be left there, and they will try to > restart infinitely. > > Just curious what would be a recommended approach for this situation? > Should we try to identify those instances and stop them in our own > deployment scripts? > > Thanks, > Kaiwen >
