its 10K sbatches. I am going to play with the suggestions here and report back.
On Wed, Feb 23, 2011 at 11:26 AM, Danny Auble <[email protected]> wrote: > Hey Paul, > > Are you saying you are starting 1 job with 10k steps, or 10k sbatchs? > > If 1 sbatch with 10k sruns you should probably up your MessageTimeout to 30 > or 60, the problem here is your slurmctld is probably getting overloaded. > What is your debug level set to in the slurmctld log? If you want better > performance you should make it smaller. > > If you are starting 10k sbatchs you should follow some of the ideas on this > page... > > https://computing.llnl.gov/linux/slurm/high_throughput.html > > In particular, > > *SchedulerParameters=defer* > > Danny > > > > On 02/23/11 11:08, Paul Thirumalai wrote: > > Hi > I have about 350 single cpu machines in my cluster. I need to run ~10K jobs > on these machines. I am submitting the job using the followign sbatch > command. Each job takes about 90 seconds to complete. > > I run /usr/bin/sbatch --begin=now <jobscript> <parameters> > > It seems that the first few thousand jobs are launched correctly. But > once in a while I see the following errors. (For about 300 jobs out of 10K) > > srun: error: slurm_receive_msg: Socket timed out on send/recv operation > srun: error: Unable to confirm allocation for job 55286: Socket timed out > on send/recv operation > srun: Check SLURM_JOB_ID environment variable for expired or invalid job. > > When i launch about 3000 jobs I dont see these errors. This would lead me > to beleive that the large volume of jobs is what is causing these errors. > What I have tried to do to alleviate this issue is create a batch size, i.e > submit 500 jobs at a time and sleep for 5 seconds, submit the next 500 jobs > and so on. This does not really seem to help. > > Is there some way I could avoid these errors. Any help appreciated. > >
