[slurm-dev] RE: timeout issues

Fitzpatrick, Ben Wed, 15 Jul 2015 00:23:57 -0700

There is some speedup information here:
http://slurm.schedmd.com/high_throughput.html


We had big performance problems on RHEL6 with the 
JobAcctGatherType=jobacct_gather/cgroup. If you use jobacct_gather/linux here, 
you can still use cgroups elsewhere.

We've tested with 9000 'nothing' (sleep 0) jobs submitted at 1000/second, and 
didn't get problems with sbatch (in fact, fork tended to fail before sbatch 
did).

Cheers,

Ben

-----Original Message-----
From: Charles Johnson [mailto:[email protected]] 
Sent: 14 July 2015 16:42
To: slurm-dev
Subject: [slurm-dev] timeout issues

slurm 14.11.7
cgroups implemented
backfill implemented

We have a small cluster -- ~650 nodes and ~6500 processors. We are 
looking for ways to lessen the impact of a busy scheduler for users who 
submit jobs with an automated submission process. Their job monitoring 
will fail with:

squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation

We are using back-fill:

SchedulerParameters=bf_interval=120,bf_continue,bf_resolution=300,bf_max_job_test=2000,bf_max_job_user=100,max_sched_time=2

Our cluster generally has numerous small, single-core; and when a user 
submits 20,000 or 30,000 jobs the system can fail to respond to squeue, 
or even sbatch.

One user has suggested we write a wrapper for certain commands, like 
squeue, which auto re-try when such messages are returned. This doesn't 
seem like the appropriate "fix." IMHO, a better approach would be to 
"fix" the submission systems that some users have.

Are there other who have faced this issue?  I have thought about caching 
the output to squeue in a file, refreshing the file in a timely way, and 
pointing an squeue wrapper to return that; but again that doesn't seem 
like a good approach.

Any suggestions would be great.

Charles

-- 
Charles Johnson, Vanderbilt University
Advanced Computing Center for Research and Education
1231 18th Avenue South
Hill Center, Suite 146
Nashville, TN 37212
Office: 615-936-8210

[slurm-dev] RE: timeout issues

Reply via email to