# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
HealthCheckInterval=500
HealthCheckProgram=/usr/local/bin/node_monitor.sh
InactiveLimit=0
KillWait=30
MessageTimeout=100
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
We have alot of churn in our system. We are up to job ID 2.3 million
and that's after roughly 2 months. So we do about 40,000 jobs per day.
It gets particularly bad when there are a ton of jobs being submitted or
exiting at the same time as the ctld is busy. Plus if it hits a blocking
portion of code it can take a bit. We don't typically see time outs but
when we do it is due to high load on the ctld.
[root@holy-slurm01 slurm]# sdiag
*******************************************************
sdiag output at Thu Oct 17 13:26:07 2013
Data since Thu Oct 17 05:17:09 2013
*******************************************************
Server thread count: 5
Agent queue size: 0
Jobs submitted: 7262
Jobs started: 25457
Jobs completed: 22316
Jobs canceled: 112
Jobs failed: 3
Main schedule statistics (microseconds):
Last cycle: 536974
Max cycle: 9358430
Total cycles: 576
Mean cycle: 3065748
Mean depth cycle: 9151
Cycles per minute: 1
Last queue length: 4642
Backfilling stats (WARNING: data obtained in the middle of backfilling
execution
Total backfilled jobs (since last slurm start): 16607
Total backfilled jobs (since last stats cycle start): 16607
Total cycles: 20
Last cycle when: Thu Oct 17 13:13:38 2013
Last cycle: 140076614
Max cycle: 2722993429
Mean cycle: 162420765
Last depth cycle: 226
Last depth cycle (try sched): 226
Depth Mean: 852
Depth Mean (try depth): 852
Last queue length: 4790
Queue length mean: 5498
On 10/17/2013 1:24 PM, Danny Auble wrote:
I am surprised you are still having timeouts. I would expect the
longest anyone waits is 2 seconds. What is your MessageTimeout set to?
Danny
On 10/17/13 10:21, Paul Edmon wrote:
True. I was just contemplating ways to make it more responsive.
Multiple copies of the data would do that, I just wasn't sure whether
keeping that in sync would be a head ache.
-Paul Edmon-
On 10/17/2013 1:01 PM, Moe Jette wrote:
Sending old data quickly seems very dangerous, especially if there
are scripts submitting jobs and then running squeue to look for them.
Quoting Paul Edmon <[email protected]>:
Another way is to use the showq script that we have been working on:
https://github.com/fasrc/slurm_showq
That gives over all statistics as well. However sdiag is a great
way to see if the system is running properly and get a good view of
it.
I will note that these sorts of queries tend to hang when the ctld
is busy. We were discussing it in our group meeting yesterday. It
might be good to have the ctld set up a thread dedicated to
servicing these requests in a timely manner. When our users see
time out messages or slow response they freak out and send in help
tickets. To me diagnostic checks like this should respond in a
timely manner, even if the data that is contained there is a little
out of date.
-Paul Edmon-
On 10/17/2013 11:57 AM, Moe Jette wrote:
There is no faster way to get job counts, but you might find the
sdiag command helpful.
Quoting Damien François <[email protected]>:
Hello,
what is the most efficient way of finding how many jobs are
currently running, pending, etc in the system ?
At the moment, I use squeue and wc -l but that sometimes gets slow.
Is there a command/flag I would have missed that would output
that information quickly ?
Thanks
damien=