[slurm-dev] Re: timeout issues

Bill Barth Tue, 14 Jul 2015 09:26:46 -0700

In addition, we try to get users who are submitting jobs at this level to
bundle them into larger single jobs. If your prolog/epilog do any work at
all, the overheads can be lower by using smaller numbers of SLURM jobs.
You can do this with SLURM job arrays or with other tools that serve to
launch independent serial tasks in parallel.


Best,
Bill. 
-- 
Bill Barth, Ph.D., Director, HPC
[email protected]        |   Phone: (512) 232-7069
Office: ROC 1.435             |   Fax:   (512) 475-9445







On 7/14/15, 11:17 AM, "John Desantis" <[email protected]> wrote:

>
>Charles,
>
>Have you tried adjusting the "MessageTimeout" in slurm.conf?
>
>We were intermittently experiencing lots of the "Socket timed out"
>messages via frequent automated node and queue checks.  When we
>changed the value to 90 seconds, we were able to decrease most of
>those messages.
>
>As far as the automated submissions go, we haven't yet run into a
>similar situation.   We did get a few users submitting jobs via
>scripts, but we targeted them using a QOS (MaxCPUs & MaxSubmitJobs) to
>control their behavior.
>
>John DeSantis
>
>
>
>2015-07-14 11:42 GMT-04:00 Charles Johnson
><[email protected]>:
>> slurm 14.11.7
>> cgroups implemented
>> backfill implemented
>>
>> We have a small cluster -- ~650 nodes and ~6500 processors. We are
>>looking
>> for ways to lessen the impact of a busy scheduler for users who submit
>>jobs
>> with an automated submission process. Their job monitoring will fail
>>with:
>>
>> squeue: error: slurm_receive_msg: Socket timed out on send/recv
>>operation
>> slurm_load_jobs error: Socket timed out on send/recv operation
>>
>> We are using back-fill:
>>
>> 
>>SchedulerParameters=bf_interval=120,bf_continue,bf_resolution=300,bf_max_
>>job_test=2000,bf_max_job_user=100,max_sched_time=2
>>
>> Our cluster generally has numerous small, single-core; and when a user
>> submits 20,000 or 30,000 jobs the system can fail to respond to squeue,
>>or
>> even sbatch.
>>
>> One user has suggested we write a wrapper for certain commands, like
>>squeue,
>> which auto re-try when such messages are returned. This doesn't seem
>>like
>> the appropriate "fix." IMHO, a better approach would be to "fix" the
>> submission systems that some users have.
>>
>> Are there other who have faced this issue?  I have thought about
>>caching the
>> output to squeue in a file, refreshing the file in a timely way, and
>> pointing an squeue wrapper to return that; but again that doesn't seem
>>like
>> a good approach.
>>
>> Any suggestions would be great.
>>
>> Charles
>>
>> --
>> Charles Johnson, Vanderbilt University
>> Advanced Computing Center for Research and Education
>> 1231 18th Avenue South
>> Hill Center, Suite 146
>> Nashville, TN 37212
>> Office: 615-936-8210
>>

[slurm-dev] Re: timeout issues

Reply via email to