Hello! Recently, running some reliability tests, we restarted all the nodes in a cluster of ~300 hosts and 3k tasks. Aurora took about 1hour to reschedule everything, we have a change of leader in the middle of the scheduling and that slowed it down even more. So we started looking which aurora parameters needed more tuning.
The value of max_tasks_per_schedule_attempt is set to the default now, that probably needs to be increased, is there a rule of thumb to tune it based on cluster size, # of jobs, # of frameworks, etc? Regarding the JVM, we are running it with Xmx=24G; so far we haven't seen pressure there. Any input on where to look at would be really appreciated :) Mauricio
