Hi Bill, thanks for the quick response. That's fair. I wonder if we could set a "start killing" threshold instead? For example, we set a "danger zone" limit so that any task that's in the danger zone is fair game to get killed. The closer it gets to the max (or over the max of course) makes it more likely to get killed, up to "it absolutely will be killed right away." This would achieve our goal of reducing the likelihood of all shards getting killed at the same time, and preserve the resource exhaustion protection you describe.
Josh On Wed, Oct 28, 2015 at 3:55 PM, Bill Farner <[email protected]> wrote: > For some resources (like disk, or more acutely - RAM), there's not much we > can do to provide assurances. Ultimately resource-driven task termination > is managed at the node level, and may represent a real exhaustion of the > resource. I'd be worried that trying to augment this might trade one > problem for another - where the rationale for killing a task becomes > non-deterministic, or even error-prone. > > On Wed, Oct 28, 2015 at 3:45 PM, Josh Adams <[email protected]> wrote: > >> Good afternoon all, >> >> Is it possible to tell the scheduler to throttle kill rates for a given >> job? When all tasks in a job start consuming too much disk or ram because >> of an unexpected service dependency meltdown it would be nice if we had a >> little buffer time to triage the issue without the scheduler killing them >> all en masse for using more than their allocated resources simultaneously... >> >> Cheers, >> Josh >> > >
