Josh, I think you are advocating for something in AURORA-279 <https://issues.apache.org/jira/browse/AURORA-279> where health check failures are sent to the scheduler to prevent all tasks being killed concurrently by the health checker. I think this is the only case where Aurora can possibly do throttling as other cases are resource exhaustion or task failure (ie a process exited).
There is no design doc out yet so no work has started on this effort. It is a lot more complicated than it seems because the following things need to be done: 1. Create a reliable mechanism for the executor to rely information (health check failure) to the scheduler. 2. Define some sort of SLA/threshold that determines if a health check failure should result in a kill or not. 3. Modifying the scheduler to act on the information in #1 and #2. On Wed, Oct 28, 2015 at 4:00 PM, Josh Adams <[email protected]> wrote: > Hi Bill, thanks for the quick response. > > That's fair. I wonder if we could set a "start killing" threshold instead? > For example, we set a "danger zone" limit so that any task that's in the > danger zone is fair game to get killed. The closer it gets to the max (or > over the max of course) makes it more likely to get killed, up to "it > absolutely will be killed right away." This would achieve our goal of > reducing the likelihood of all shards getting killed at the same time, and > preserve the resource exhaustion protection you describe. > > Josh > > On Wed, Oct 28, 2015 at 3:55 PM, Bill Farner <[email protected]> wrote: > >> For some resources (like disk, or more acutely - RAM), there's not much >> we can do to provide assurances. Ultimately resource-driven task >> termination is managed at the node level, and may represent a real >> exhaustion of the resource. I'd be worried that trying to augment this >> might trade one problem for another - where the rationale for killing a >> task becomes non-deterministic, or even error-prone. >> >> On Wed, Oct 28, 2015 at 3:45 PM, Josh Adams <[email protected]> wrote: >> >>> Good afternoon all, >>> >>> Is it possible to tell the scheduler to throttle kill rates for a given >>> job? When all tasks in a job start consuming too much disk or ram because >>> of an unexpected service dependency meltdown it would be nice if we had a >>> little buffer time to triage the issue without the scheduler killing them >>> all en masse for using more than their allocated resources simultaneously... >>> >>> Cheers, >>> Josh >>> >> >> > -- Zameer Manji
