Josh,

I think you are advocating for something in AURORA-279
<https://issues.apache.org/jira/browse/AURORA-279> where health check
failures are sent to the scheduler to prevent all tasks being killed
concurrently by the health checker. I think this is the only case where
Aurora can possibly do throttling as other cases are resource exhaustion or
task failure (ie a process exited).

There is no design doc out yet so no work has started on this effort. It is
a lot more complicated than it seems because the following things need to
be done:
1. Create a reliable mechanism for the executor to rely information (health
check failure) to the scheduler.
2. Define some sort of SLA/threshold that determines if a health check
failure should result in a kill or not.
3. Modifying the scheduler to act on the information in #1 and #2.


On Wed, Oct 28, 2015 at 4:00 PM, Josh Adams <[email protected]> wrote:

> Hi Bill, thanks for the quick response.
>
> That's fair. I wonder if we could set a "start killing" threshold instead?
> For example, we set a "danger zone" limit so that any task that's in the
> danger zone is fair game to get killed. The closer it gets to the max (or
> over the max of course) makes it more likely to get killed, up to "it
> absolutely will be killed right away." This would achieve our goal of
> reducing the likelihood of all shards getting killed at the same time, and
> preserve the resource exhaustion protection you describe.
>
> Josh
>
> On Wed, Oct 28, 2015 at 3:55 PM, Bill Farner <[email protected]> wrote:
>
>> For some resources (like disk, or more acutely - RAM), there's not much
>> we can do to provide assurances.  Ultimately resource-driven task
>> termination is managed at the node level, and may represent a real
>> exhaustion of the resource.  I'd be worried that trying to augment this
>> might trade one problem for another - where the rationale for killing a
>> task becomes non-deterministic, or even error-prone.
>>
>> On Wed, Oct 28, 2015 at 3:45 PM, Josh Adams <[email protected]> wrote:
>>
>>> Good afternoon all,
>>>
>>> Is it possible to tell the scheduler to throttle kill rates for a given
>>> job? When all tasks in a job start consuming too much disk or ram because
>>> of an unexpected service dependency meltdown it would be nice if we had a
>>> little buffer time to triage the issue without the scheduler killing them
>>> all en masse for using more than their allocated resources simultaneously...
>>>
>>> Cheers,
>>> Josh
>>>
>>
>>
>


-- 
Zameer Manji

Reply via email to