Charles,

Have you tried adjusting the "MessageTimeout" in slurm.conf?

We were intermittently experiencing lots of the "Socket timed out"
messages via frequent automated node and queue checks.  When we
changed the value to 90 seconds, we were able to decrease most of
those messages.

As far as the automated submissions go, we haven't yet run into a
similar situation.   We did get a few users submitting jobs via
scripts, but we targeted them using a QOS (MaxCPUs & MaxSubmitJobs) to
control their behavior.

John DeSantis



2015-07-14 11:42 GMT-04:00 Charles Johnson <[email protected]>:
> slurm 14.11.7
> cgroups implemented
> backfill implemented
>
> We have a small cluster -- ~650 nodes and ~6500 processors. We are looking
> for ways to lessen the impact of a busy scheduler for users who submit jobs
> with an automated submission process. Their job monitoring will fail with:
>
> squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
> slurm_load_jobs error: Socket timed out on send/recv operation
>
> We are using back-fill:
>
> SchedulerParameters=bf_interval=120,bf_continue,bf_resolution=300,bf_max_job_test=2000,bf_max_job_user=100,max_sched_time=2
>
> Our cluster generally has numerous small, single-core; and when a user
> submits 20,000 or 30,000 jobs the system can fail to respond to squeue, or
> even sbatch.
>
> One user has suggested we write a wrapper for certain commands, like squeue,
> which auto re-try when such messages are returned. This doesn't seem like
> the appropriate "fix." IMHO, a better approach would be to "fix" the
> submission systems that some users have.
>
> Are there other who have faced this issue?  I have thought about caching the
> output to squeue in a file, refreshing the file in a timely way, and
> pointing an squeue wrapper to return that; but again that doesn't seem like
> a good approach.
>
> Any suggestions would be great.
>
> Charles
>
> --
> Charles Johnson, Vanderbilt University
> Advanced Computing Center for Research and Education
> 1231 18th Avenue South
> Hill Center, Suite 146
> Nashville, TN 37212
> Office: 615-936-8210
>

Reply via email to