[slurm-dev] Re: Proper way of finding how many jobs are currently running/pending ?

Paul Edmon Thu, 17 Oct 2013 11:27:43 -0700


We will wait then for now.  Be we look forward to the new version.


-Paul Edmon-

On 10/17/2013 1:56 PM, Danny Auble wrote:

You could install 13.12 on top of 2.6 today, but it isn't consideredstable, the RPCs may change from now until it is officially tagged.Currently there is a cutoff on Nov 15 for new code. Until then RPCscould change which could make job loss possible. We would notrecommend running any production system on 13.12 at this moment.
If you did want to fly by the seat of you pants you should be able toupdate 2.6 to 13.12 today with no job loss, but any future version of13.12 doesn't offer this gaurentee. Only after offical tags wouldthis be considered safe.
On 10/17/13 10:49, Paul Edmon wrote:
Is there an upgrade path from 2.6 to 13.12?

-Paul Edmon-

On 10/17/2013 1:34 PM, Danny Auble wrote:
There was a modification to the way lists were sorted in 13.12. Thiscould make a massive difference with the speed of squeue on yoursystem. You could backport the change to 2.6 but it took a fewrevisions and might not be that easy to fully backport.
On 10/17/13 10:30, Paul Edmon wrote:
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
HealthCheckInterval=500
HealthCheckProgram=/usr/local/bin/node_monitor.sh
InactiveLimit=0
KillWait=30
MessageTimeout=100
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
We have alot of churn in our system. We are up to job ID 2.3million and that's after roughly 2 months. So we do about 40,000jobs per day. It gets particularly bad when there are a ton ofjobs being submitted or exiting at the same time as the ctld isbusy. Plus if it hits a blocking portion of code it can take abit. We don't typically see time outs but when we do it is due tohigh load on the ctld.
[root@holy-slurm01 slurm]# sdiag
*******************************************************
sdiag output at Thu Oct 17 13:26:07 2013
Data since      Thu Oct 17 05:17:09 2013
*******************************************************
Server thread count: 5
Agent queue size:    0

Jobs submitted: 7262
Jobs started:   25457
Jobs completed: 22316
Jobs canceled:  112
Jobs failed:    3

Main schedule statistics (microseconds):
        Last cycle:   536974
        Max cycle:    9358430
        Total cycles: 576
        Mean cycle:   3065748
        Mean depth cycle:  9151
        Cycles per minute: 1
        Last queue length: 4642
Backfilling stats (WARNING: data obtained in the middle ofbackfilling execution
        Total backfilled jobs (since last slurm start): 16607
        Total backfilled jobs (since last stats cycle start): 16607
        Total cycles: 20
        Last cycle when: Thu Oct 17 13:13:38 2013
        Last cycle: 140076614
        Max cycle:  2722993429
        Mean cycle: 162420765
        Last depth cycle: 226
        Last depth cycle (try sched): 226
        Depth Mean: 852
        Depth Mean (try depth): 852
        Last queue length: 4790
        Queue length mean: 5498


On 10/17/2013 1:24 PM, Danny Auble wrote:
I am surprised you are still having timeouts. I would expect thelongest anyone waits is 2 seconds. What is your MessageTimeoutset to?
Danny

On 10/17/13 10:21, Paul Edmon wrote:
True. I was just contemplating ways to make it more responsive.Multiple copies of the data would do that, I just wasn't surewhether keeping that in sync would be a head ache.
-Paul Edmon-

On 10/17/2013 1:01 PM, Moe Jette wrote:
Sending old data quickly seems very dangerous, especially ifthere are scripts submitting jobs and then running squeue tolook for them.
Quoting Paul Edmon <[email protected]>:
Another way is to use the showq script that we have beenworking on:
https://github.com/fasrc/slurm_showq
That gives over all statistics as well. However sdiag is agreat way to see if the system is running properly and get agood view of it.
I will note that these sorts of queries tend to hang when thectld is busy. We were discussing it in our group meetingyesterday. It might be good to have the ctld set up a threaddedicated to servicing these requests in a timely manner. Whenour users see time out messages or slow response they freak outand send in help tickets. To me diagnostic checks like thisshould respond in a timely manner, even if the data that iscontained there is a little out of date.
-Paul Edmon-

On 10/17/2013 11:57 AM, Moe Jette wrote:
There is no faster way to get job counts, but you might findthe sdiag command helpful.
Quoting Damien François <[email protected]>:
Hello,
what is the most efficient way of finding how many jobs arecurrently running, pending, etc in the system ?
At the moment, I use squeue and wc -l but that sometimes getsslow.
Is there a command/flag I would have missed that would outputthat information quickly ?
Thanks

damien=

[slurm-dev] Re: Proper way of finding how many jobs are currently running/pending ?

Reply via email to