Hi, Changing the SchedulerParameters made a big improvement.
One thing I've noticed is that if I try to cancel all running and pending jobs, this takes many hours to complete and during this time everything starts giving the "Socket timed out on send/recv operation" errors. Obviously cancelling all jobs isn't something that would be done in a production environment, but I'm curious why this happens. Regards, Andrew. ________________________________________ From: [email protected] [[email protected]] Sent: Wednesday, May 22, 2013 7:36 PM To: slurm-dev Subject: [slurm-dev] Re: Problems when using sched/backfill Hi, After increasing the log level I could see lots of messages like: backfill: completed yielding locks Also, sdiag said that the backfilling cycle was taking around 160 seconds. I'll try changing the SchedulerParameters as suggested and see if this helps. Thanks, Andrew. ________________________________ From: Tim Carlson [[email protected]] Sent: Wednesday, May 22, 2013 5:51 PM To: slurm-dev Subject: [slurm-dev] Re: Problems when using sched/backfill We have a similar setup and this is our current setup. Without tuning these, you are in a world of hurt with your job mix and doing backfill. SchedulerParameters=default_queue_depth=50,bf_interval=120,bf_window=300,bf_max_job_user=60 The bf_max_job_user is key for us. On Tue, May 21, 2013 at 3:10 PM, Carles Fenoy <[email protected]<mailto:[email protected]>> wrote: Hi all, Use sdiag to see if the backfilling is too slow. If it is, tune the scheduler parameters. There is a bf_max_jobs or something like this that will limit the number of jobs evaluated and will decrease considerably the scheduling time Regards, Carles Fenoy Barcelona Supercomputing Center El 21/05/2013 23:15, "Bjørn-Helge Mevik" <[email protected]<mailto:[email protected]>> escribió: If you increase the log level, for instance set SlurmctldDebug=debug DebugFlags=Backfill you might get more information about what happens. If it is the backfilling that takes too long, you should see messages about backfill "yielding locks". If I recall correctly, the backfill scheduler used to time out after MessageTimeout/2 seconds, but looking at the code for 2.5.6 this seems to have changed. Keep us posted about what you find. I'm planning to switch to 2.5.6 tomorrow, and have from time to time had problems getting the backfilling to be fast enough. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo -- Scanned by iCritical. -- Scanned by iCritical.
