Hey Paul,

Are you saying you are starting 1 job with 10k steps, or 10k sbatchs?

If 1 sbatch with 10k sruns you should probably up your MessageTimeout to 30 or 
60, the problem here is your slurmctld is probably getting overloaded.  What is 
your debug level set to in the slurmctld log?  If you want better performance 
you should make it smaller.

If you are starting 10k sbatchs you should follow some of the ideas on this 
page...

https://computing.llnl.gov/linux/slurm/high_throughput.html

In particular,

*SchedulerParameters=defer*

Danny


On 02/23/11 11:08, Paul Thirumalai wrote:
Hi
I have about 350 single cpu machines in my cluster. I need to run ~10K jobs on 
these machines. I am submitting the job using the followign sbatch command. 
Each job takes about 90 seconds to complete.

I run /usr/bin/sbatch --begin=now <jobscript> <parameters>

It seems that the first few thousand jobs are launched correctly. But once in a 
while I see the following  errors. (For about 300 jobs out of 10K)

srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: error: Unable to confirm allocation for job 55286: Socket timed out on 
send/recv operation
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.

When i launch about 3000 jobs I dont see these errors. This would lead me to 
beleive that the large volume of jobs is what is causing these errors. What I 
have tried to do to alleviate this issue is create a batch size, i.e submit 500 
jobs at a time and sleep for 5 seconds, submit the next 500 jobs and so on. 
This does not really seem to help.

Is there some way I could avoid these errors. Any help appreciated.

Reply via email to