its 10K sbatches. I am going to play with the suggestions here and report
back.

On Wed, Feb 23, 2011 at 11:26 AM, Danny Auble <[email protected]> wrote:

>  Hey Paul,
>
> Are you saying you are starting 1 job with 10k steps, or 10k sbatchs?
>
> If 1 sbatch with 10k sruns you should probably up your MessageTimeout to 30
> or 60, the problem here is your slurmctld is probably getting overloaded.
> What is your debug level set to in the slurmctld log?  If you want better
> performance you should make it smaller.
>
> If you are starting 10k sbatchs you should follow some of the ideas on this
> page...
>
> https://computing.llnl.gov/linux/slurm/high_throughput.html
>
> In particular,
>
> *SchedulerParameters=defer*
>
> Danny
>
>
>
> On 02/23/11 11:08, Paul Thirumalai wrote:
>
> Hi
> I have about 350 single cpu machines in my cluster. I need to run ~10K jobs
> on these machines. I am submitting the job using the followign sbatch
> command. Each job takes about 90 seconds to complete.
>
>  I run /usr/bin/sbatch --begin=now <jobscript> <parameters>
>
>  It seems that the first few thousand jobs are launched correctly. But
> once in a while I see the following  errors. (For about 300 jobs out of 10K)
>
>   srun: error: slurm_receive_msg: Socket timed out on send/recv operation
> srun: error: Unable to confirm allocation for job 55286: Socket timed out
> on send/recv operation
> srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
>
>  When i launch about 3000 jobs I dont see these errors. This would lead me
> to beleive that the large volume of jobs is what is causing these errors.
> What I have tried to do to alleviate this issue is create a batch size, i.e
> submit 500 jobs at a time and sleep for 5 seconds, submit the next 500 jobs
> and so on. This does not really seem to help.
>
>  Is there some way I could avoid these errors. Any help appreciated.
>
>

Reply via email to