Hi I have about 350 single cpu machines in my cluster. I need to run ~10K jobs on these machines. I am submitting the job using the followign sbatch command. Each job takes about 90 seconds to complete.
I run /usr/bin/sbatch --begin=now <jobscript> <parameters> It seems that the first few thousand jobs are launched correctly. But once in a while I see the following errors. (For about 300 jobs out of 10K) srun: error: slurm_receive_msg: Socket timed out on send/recv operation srun: error: Unable to confirm allocation for job 55286: Socket timed out on send/recv operation srun: Check SLURM_JOB_ID environment variable for expired or invalid job. When i launch about 3000 jobs I dont see these errors. This would lead me to beleive that the large volume of jobs is what is causing these errors. What I have tried to do to alleviate this issue is create a batch size, i.e submit 500 jobs at a time and sleep for 5 seconds, submit the next 500 jobs and so on. This does not really seem to help. Is there some way I could avoid these errors. Any help appreciated.
