Hi
I have about 350 single cpu machines in my cluster. I need to run ~10K jobs
on these machines. I am submitting the job using the followign sbatch
command. Each job takes about 90 seconds to complete.

I run /usr/bin/sbatch --begin=now <jobscript> <parameters>

It seems that the first few thousand jobs are launched correctly. But once
in a while I see the following  errors. (For about 300 jobs out of 10K)

srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: error: Unable to confirm allocation for job 55286: Socket timed out on
send/recv operation
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.

When i launch about 3000 jobs I dont see these errors. This would lead me to
beleive that the large volume of jobs is what is causing these errors. What
I have tried to do to alleviate this issue is create a batch size, i.e
submit 500 jobs at a time and sleep for 5 seconds, submit the next 500 jobs
and so on. This does not really seem to help.

Is there some way I could avoid these errors. Any help appreciated.

Reply via email to