Hello!

Recently, running some reliability tests, we restarted all the nodes in a
cluster of ~300 hosts and 3k tasks. Aurora took about 1hour to reschedule
everything, we have a change of leader in the middle of the scheduling and
that slowed it down even more. So we started looking which aurora
parameters needed more tuning.

The value of max_tasks_per_schedule_attempt is set to the default now, that
probably needs to be increased, is there a rule of thumb to tune it based
on cluster size, # of jobs, # of frameworks, etc?

Regarding the JVM, we are running it with Xmx=24G; so far we haven't seen
pressure there.

Any input on where to look at would be really appreciated :)

Mauricio

Reply via email to