Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

Paul Edmon Mon, 26 Aug 2019 07:14:54 -0700

We've hit this before due to RPC saturation. I highly recommend usingmax_rpc_cnt and/or defer for scheduling. That should help alleviatethis problem.


-Paul Edmon-


On 8/26/19 2:12 AM, Guillaume Perrault Archambault wrote:

Hello,
I wrote a regression-testing toolkit to manage large numbers of SLURMjobs and their output (the toolkit can be found here<https://github.com/gobbedy/slurm_simulation_toolkit/> if anyone isinterested).
To make job launching faster, sbatch commands are forked, so thatnumerous jobs may be submitted in parallel.
We (the cluster admin and myself) are concerned that this may causeunresponsiveness for other users.
I cannot say for sure since I don't have visibility over all users ofthe cluster, but unresponsiveness doesn't seem to have occurred sofar. That being said, the fact that it hasn't occurred yet doesn'tmean it won't in the future. So I'm treating this as a ticking timebomb to be fixed asap.
My questions are the following:
1) Does anyone have experience with large numbers of jobs submitted inparallel? What are the limits that can be hit? For example is theresome hard limit on how many jobs a SLURM scheduler can handle beforeblacking out / slowing down?
2) Is there a way for me to find/measure/ping this resource limit?
3) How can I make sure I don't hit this resource limit?
From what I've observed, parallel submission can improve submissiontime by a factor at least 10x. This can make a big difference inusers' workflows.
For that reason I would like to keep the option of launching jobssequentially as a last resort.
Thanks in advance.

Regards,
Guillaume.

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

Reply via email to