Andrej N. Gritsenko a écrit :
Hello!
Sometime around Mon, Nov 7, 23:36, I've received written by Clay Teeter:
Here is my patch. Does this look ok, for now anyway?
And, 20K jobs are now beings pulled at around 1s !!!!!!
Thank you. With that patch against squeue in our case with 20k
jobs in queue they are showed in about 15s where most of time was spent
by slurmctld which gone to 99% CPU load at that time. Also it's gone too
slow on adding a job - when queue is grown to 20k jobs it accepts only
about 3 jobs per second while with empty queue it can accept tens of
jobs per second. Unfortunately we don't have any profiler there to dig
which function consumes the CPU so no solution is made yet.
With best wishes.
Andriy.
Hi,
if not already done, you should probably consider the use of
SchedulerParameters=defer in the controller slurm.conf. Without that,
every submission involves an attempt of the scheduler logic which
probably takes some time to manage the 20k jobs. With that option, you
should no longer have this complexity, only the internal scheduling
thread will do the scheduling part every 30 seconds or so and your job
should be submitted more quickly. One problem that we experimented with
that option, is that the -I parameter of srun was not working properly
with defer mode and you can no longer use it. I do not know if it is
still the case with later version of slurm.
Regards,
Matthieu