That works out to scheduling about 1 task/sec, which is at least one order
of magnitude lower than i would expect.  Are you sure tasks were scheduling
and continuing to run, rather than exiting/failing and triggering more
scheduling work?

What build is this from?  Can you share (scrubbed) scheduler logs from this
period?

On Wed, Nov 29, 2017 at 11:54 AM, Mauricio Garavaglia <
mauriciogaravag...@gmail.com> wrote:

> Hello!
>
> Recently, running some reliability tests, we restarted all the nodes in a
> cluster of ~300 hosts and 3k tasks. Aurora took about 1hour to reschedule
> everything, we have a change of leader in the middle of the scheduling and
> that slowed it down even more. So we started looking which aurora
> parameters needed more tuning.
>
> The value of max_tasks_per_schedule_attempt is set to the default now,
> that probably needs to be increased, is there a rule of thumb to tune it
> based on cluster size, # of jobs, # of frameworks, etc?
>
> Regarding the JVM, we are running it with Xmx=24G; so far we haven't seen
> pressure there.
>
> Any input on where to look at would be really appreciated :)
>
> Mauricio
>
>
>
>
>

Reply via email to