[slurm-dev] Re: timeout issues

Danny Auble Tue, 14 Jul 2015 09:25:40 -0700

Perhaps teaching the users about job arrays would help a lot in thissituation. They could submit all 20-30k jobs with only 1 sbatchcommand. It is much more efficient for the scheduler and would probablyeliminate almost all the timeout issues you are seeing.


On 07/14/15 08:42, Charles Johnson wrote:

slurm 14.11.7
cgroups implemented
backfill implemented
We have a small cluster -- ~650 nodes and ~6500 processors. We arelooking for ways to lessen the impact of a busy scheduler for userswho submit jobs with an automated submission process. Their jobmonitoring will fail with:
squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation

We are using back-fill:
SchedulerParameters=bf_interval=120,bf_continue,bf_resolution=300,bf_max_job_test=2000,bf_max_job_user=100,max_sched_time=2
Our cluster generally has numerous small, single-core; and when a usersubmits 20,000 or 30,000 jobs the system can fail to respond tosqueue, or even sbatch.
One user has suggested we write a wrapper for certain commands, likesqueue, which auto re-try when such messages are returned. Thisdoesn't seem like the appropriate "fix." IMHO, a better approach wouldbe to "fix" the submission systems that some users have.
Are there other who have faced this issue? I have thought aboutcaching the output to squeue in a file, refreshing the file in atimely way, and pointing an squeue wrapper to return that; but againthat doesn't seem like a good approach.
Any suggestions would be great.

Charles

[slurm-dev] Re: timeout issues

Reply via email to