There is some speedup information here: http://slurm.schedmd.com/high_throughput.html
We had big performance problems on RHEL6 with the JobAcctGatherType=jobacct_gather/cgroup. If you use jobacct_gather/linux here, you can still use cgroups elsewhere. We've tested with 9000 'nothing' (sleep 0) jobs submitted at 1000/second, and didn't get problems with sbatch (in fact, fork tended to fail before sbatch did). Cheers, Ben -----Original Message----- From: Charles Johnson [mailto:[email protected]] Sent: 14 July 2015 16:42 To: slurm-dev Subject: [slurm-dev] timeout issues slurm 14.11.7 cgroups implemented backfill implemented We have a small cluster -- ~650 nodes and ~6500 processors. We are looking for ways to lessen the impact of a busy scheduler for users who submit jobs with an automated submission process. Their job monitoring will fail with: squeue: error: slurm_receive_msg: Socket timed out on send/recv operation slurm_load_jobs error: Socket timed out on send/recv operation We are using back-fill: SchedulerParameters=bf_interval=120,bf_continue,bf_resolution=300,bf_max_job_test=2000,bf_max_job_user=100,max_sched_time=2 Our cluster generally has numerous small, single-core; and when a user submits 20,000 or 30,000 jobs the system can fail to respond to squeue, or even sbatch. One user has suggested we write a wrapper for certain commands, like squeue, which auto re-try when such messages are returned. This doesn't seem like the appropriate "fix." IMHO, a better approach would be to "fix" the submission systems that some users have. Are there other who have faced this issue? I have thought about caching the output to squeue in a file, refreshing the file in a timely way, and pointing an squeue wrapper to return that; but again that doesn't seem like a good approach. Any suggestions would be great. Charles -- Charles Johnson, Vanderbilt University Advanced Computing Center for Research and Education 1231 18th Avenue South Hill Center, Suite 146 Nashville, TN 37212 Office: 615-936-8210
